Kragen Javier Sitaker | 31 Jul 2012 01:00

language-model dimensionality reduction with Markov-chain distribution medians

In my kragen-tol post earlier this month about [predictive text input
methods][0], I wrote:

> You could do *much better* than SwiftKey with dimensionality reduction
> techniques, which could allow much more context to be brought to bear
> on the prediction problem using a much smaller model.  SwiftKey often
> makes predictions that yield clearly ungrammatical 3-grams, which even
> a simple 3-gram model over part-of-speech tags ought to be able to
> reject.

[0]: http://lists.canonical.org/pipermail/kragen-tol/2012-July/000961.html

But part-of-speech tagging is sort of hard, especially on incomplete and
possibly erroneous sentences containing words that aren't in the dictionary.
But it seems like you ought to be able to do a good enough job with simple
statistics to be useful.

Now, the rest of this post may be a case of "a few hours of hacking can save
you several minutes in the library", so take that under advisement.  I haven't
checked out the literature enough to know if this idea is already known, let
alone an improvement over what already exists.

One simple approach is to consider a Markov-chain model of text, and cluster
together words whose transition probabilities are similar, either for
in-transitions (what they tend to follow) or out-transitions (what they tend to
precede).

However, this runs into a practical problem immediately: even with a
first-order Markov model, you're trying to perform unsupervised classification
on thousands of vectors in a space with thousands of dimensions.  This is, as
(Continue reading)


Gmane