Hi Frank, we are using a very similar system in production.
Hashing text like data to a 50000 dimensional vector with two probes, and
then applying tf-idf weighting.

For IDF we dont keep a separate weight dictionary but just count the
distinct training examples ("documents") that have a non null value per
column.
so there is a full idf vector that can be used.
Instead of Euclidean Distance we use Cosine (Performance Reasons).

The results are very good, building such a system is easy and maybe it's
worth a try.

For representing the cluster we have a separate job that assigns users
("documents") to clusters and shows the most discriminating words for the
cluster via the LogLikelihood class. The results are then visualized using
http://wordcram.org/ for the whoah effect.

Cheers,

Johannes


On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <[email protected]> wrote:

> On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <[email protected]
> >wrote:
>
> > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > Yes.  Hashing vector encoders will preserve distances when used with
> > > multiple probes.
> > >
> >
> > So if a token occurs two times in a document the first token will be
> mapped
> > to a given location and when the token is hashed the second time it will
> be
> > mapped to a different location, right?
> >
>
> No.  The same token will always hash to the same location(s).
>
>
> > I am wondering if when many probes are used and a large enough vector
> this
> > process mimics TF weighting, since documents that have a high TF of a
> given
> > token will have the same positions marked in the vector. As Suneel said
> > when we then use the Hamming distance the vectors that are close to each
> > other should be in the same cluster.
> >
>
> Hamming distance doesn't quite work because you want to have collisions to
> a sum rather than an OR.  Also, if you apply weights to the words, these
> weights will be added to all of the probe locations for the words.  This
> means we still need a plus/times/L2 dot product rather than an plus/AND/L1
> dot product like the Hamming distance uses.
>
> >
> > > Interpretation becomes somewhat difficult, but there is code available
> to
> > > reverse engineer labels on hashed vectors.
> >
> >
> > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> which
> > words it has seen but I don't see how to go from a position or several
> > positions in the vector to labels. Is there an example in the code I can
> > look at?
> >
>
> Yes.  The newsgroups example applies.
>
> The AdaptiveWordEncoder counts word occurrences that it sees and uses the
> IDF based on the resulting counts.  This assumes that all instances of the
> AWE will see the same rough distribution of words to work.  It is fine for
> lots of applications and not fine for lots.
>
>
> >
> >
> > > IDF weighting is slightly tricky, but quite doable if you keep a
> > dictionary
> > > of, say, the most common 50-200 thousand words and assume all others
> have
> > > constant and equal frequency.
> > >
> >
> > How would IDF weighting work in conjunction with hashing? First build up
> a
> > dictionary of 50-200 and pass that into the vector encoders? The drawback
> > of this is that you have another pass through the data and another
> 'input'
> > to keep track of and configure. But maybe it has to be like that.
>
>
> With hashing, you still have the option of applying a weight to the hashed
> representation of each word.  The question is what weight.
>
> To build a small dictionary, you don't have to go through all of the data.
>  Just enough to get reasonably accurate weights for most words.  All words
> not yet seen can be assumed to be rare and thus get the nominal "rare-word"
> weight.
>
> Keeping track of the dictionary of weights is, indeed, a pain.
>
>
>
> > The
> > reason I like the hashed encoders is that vectorizing can be done in a
> > streaming manner at the last possible moment. With the current tools you
> > have to do: data -> data2seq -> seq2sparse -> kmeans.
> >
>
> Indeed.  That is the great virtue.
>
>
> >
> > If this approach is doable I would like to code up a Java non-Hadoop
> > example using the Reuters dataset which vectorizes each doc using the
> > hashing encoders, configures KMeans with Hamming distance and then write
> > some code to get the labels.
> >
>
> Use Euclidean distance, not Hamming.
>
> You can definitely use the AWE here if you randomize document ordering.
>

Reply via email to