Hi Johannes,

Sounds good.

The step for finding labels is still unclear to me. You use the
Loglikelihood class on the original documents? How? Or do you mean the
collocation job?

Cheers,

Frank







On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
[email protected]> wrote:

> Hi Frank, we are using a very similar system in production.
> Hashing text like data to a 50000 dimensional vector with two probes, and
> then applying tf-idf weighting.
>
> For IDF we dont keep a separate weight dictionary but just count the
> distinct training examples ("documents") that have a non null value per
> column.
> so there is a full idf vector that can be used.
> Instead of Euclidean Distance we use Cosine (Performance Reasons).
>
> The results are very good, building such a system is easy and maybe it's
> worth a try.
>
> For representing the cluster we have a separate job that assigns users
> ("documents") to clusters and shows the most discriminating words for the
> cluster via the LogLikelihood class. The results are then visualized using
> http://wordcram.org/ for the whoah effect.
>
> Cheers,
>
> Johannes
>
>
> On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <[email protected]>
> wrote:
>
> > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <[email protected]
> > >wrote:
> >
> > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > Yes.  Hashing vector encoders will preserve distances when used with
> > > > multiple probes.
> > > >
> > >
> > > So if a token occurs two times in a document the first token will be
> > mapped
> > > to a given location and when the token is hashed the second time it
> will
> > be
> > > mapped to a different location, right?
> > >
> >
> > No.  The same token will always hash to the same location(s).
> >
> >
> > > I am wondering if when many probes are used and a large enough vector
> > this
> > > process mimics TF weighting, since documents that have a high TF of a
> > given
> > > token will have the same positions marked in the vector. As Suneel said
> > > when we then use the Hamming distance the vectors that are close to
> each
> > > other should be in the same cluster.
> > >
> >
> > Hamming distance doesn't quite work because you want to have collisions
> to
> > a sum rather than an OR.  Also, if you apply weights to the words, these
> > weights will be added to all of the probe locations for the words.  This
> > means we still need a plus/times/L2 dot product rather than an
> plus/AND/L1
> > dot product like the Hamming distance uses.
> >
> > >
> > > > Interpretation becomes somewhat difficult, but there is code
> available
> > to
> > > > reverse engineer labels on hashed vectors.
> > >
> > >
> > > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> > which
> > > words it has seen but I don't see how to go from a position or several
> > > positions in the vector to labels. Is there an example in the code I
> can
> > > look at?
> > >
> >
> > Yes.  The newsgroups example applies.
> >
> > The AdaptiveWordEncoder counts word occurrences that it sees and uses the
> > IDF based on the resulting counts.  This assumes that all instances of
> the
> > AWE will see the same rough distribution of words to work.  It is fine
> for
> > lots of applications and not fine for lots.
> >
> >
> > >
> > >
> > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > dictionary
> > > > of, say, the most common 50-200 thousand words and assume all others
> > have
> > > > constant and equal frequency.
> > > >
> > >
> > > How would IDF weighting work in conjunction with hashing? First build
> up
> > a
> > > dictionary of 50-200 and pass that into the vector encoders? The
> drawback
> > > of this is that you have another pass through the data and another
> > 'input'
> > > to keep track of and configure. But maybe it has to be like that.
> >
> >
> > With hashing, you still have the option of applying a weight to the
> hashed
> > representation of each word.  The question is what weight.
> >
> > To build a small dictionary, you don't have to go through all of the
> data.
> >  Just enough to get reasonably accurate weights for most words.  All
> words
> > not yet seen can be assumed to be rare and thus get the nominal
> "rare-word"
> > weight.
> >
> > Keeping track of the dictionary of weights is, indeed, a pain.
> >
> >
> >
> > > The
> > > reason I like the hashed encoders is that vectorizing can be done in a
> > > streaming manner at the last possible moment. With the current tools
> you
> > > have to do: data -> data2seq -> seq2sparse -> kmeans.
> > >
> >
> > Indeed.  That is the great virtue.
> >
> >
> > >
> > > If this approach is doable I would like to code up a Java non-Hadoop
> > > example using the Reuters dataset which vectorizes each doc using the
> > > hashing encoders, configures KMeans with Hamming distance and then
> write
> > > some code to get the labels.
> > >
> >
> > Use Euclidean distance, not Hamming.
> >
> > You can definitely use the AWE here if you randomize document ordering.
> >
>

Reply via email to