Hi frank,

no, no collocation job. You just take a big enough sample of documents and
assign it to it's cluster with the learned ClusterClassifier. Parallel to
that you count the total words in a guava multiset and the per cluster word
counts in a multiset. The LogLikelihood class contains a convenient method
that takes two multisets that you use for all clusters.

there should be no need in starting a map reduce job for that, with some
ram you can just stream the documents from the hdfs




On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten <[email protected]>wrote:

> Hi Johannes,
>
> Sounds good.
>
> The step for finding labels is still unclear to me. You use the
> Loglikelihood class on the original documents? How? Or do you mean the
> collocation job?
>
> Cheers,
>
> Frank
>
>
>
>
>
>
>
> On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
> [email protected]> wrote:
>
> > Hi Frank, we are using a very similar system in production.
> > Hashing text like data to a 50000 dimensional vector with two probes, and
> > then applying tf-idf weighting.
> >
> > For IDF we dont keep a separate weight dictionary but just count the
> > distinct training examples ("documents") that have a non null value per
> > column.
> > so there is a full idf vector that can be used.
> > Instead of Euclidean Distance we use Cosine (Performance Reasons).
> >
> > The results are very good, building such a system is easy and maybe it's
> > worth a try.
> >
> > For representing the cluster we have a separate job that assigns users
> > ("documents") to clusters and shows the most discriminating words for the
> > cluster via the LogLikelihood class. The results are then visualized
> using
> > http://wordcram.org/ for the whoah effect.
> >
> > Cheers,
> >
> > Johannes
> >
> >
> > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <
> [email protected]
> > > >wrote:
> > >
> > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Yes.  Hashing vector encoders will preserve distances when used
> with
> > > > > multiple probes.
> > > > >
> > > >
> > > > So if a token occurs two times in a document the first token will be
> > > mapped
> > > > to a given location and when the token is hashed the second time it
> > will
> > > be
> > > > mapped to a different location, right?
> > > >
> > >
> > > No.  The same token will always hash to the same location(s).
> > >
> > >
> > > > I am wondering if when many probes are used and a large enough vector
> > > this
> > > > process mimics TF weighting, since documents that have a high TF of a
> > > given
> > > > token will have the same positions marked in the vector. As Suneel
> said
> > > > when we then use the Hamming distance the vectors that are close to
> > each
> > > > other should be in the same cluster.
> > > >
> > >
> > > Hamming distance doesn't quite work because you want to have collisions
> > to
> > > a sum rather than an OR.  Also, if you apply weights to the words,
> these
> > > weights will be added to all of the probe locations for the words.
>  This
> > > means we still need a plus/times/L2 dot product rather than an
> > plus/AND/L1
> > > dot product like the Hamming distance uses.
> > >
> > > >
> > > > > Interpretation becomes somewhat difficult, but there is code
> > available
> > > to
> > > > > reverse engineer labels on hashed vectors.
> > > >
> > > >
> > > > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> > > which
> > > > words it has seen but I don't see how to go from a position or
> several
> > > > positions in the vector to labels. Is there an example in the code I
> > can
> > > > look at?
> > > >
> > >
> > > Yes.  The newsgroups example applies.
> > >
> > > The AdaptiveWordEncoder counts word occurrences that it sees and uses
> the
> > > IDF based on the resulting counts.  This assumes that all instances of
> > the
> > > AWE will see the same rough distribution of words to work.  It is fine
> > for
> > > lots of applications and not fine for lots.
> > >
> > >
> > > >
> > > >
> > > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > > dictionary
> > > > > of, say, the most common 50-200 thousand words and assume all
> others
> > > have
> > > > > constant and equal frequency.
> > > > >
> > > >
> > > > How would IDF weighting work in conjunction with hashing? First build
> > up
> > > a
> > > > dictionary of 50-200 and pass that into the vector encoders? The
> > drawback
> > > > of this is that you have another pass through the data and another
> > > 'input'
> > > > to keep track of and configure. But maybe it has to be like that.
> > >
> > >
> > > With hashing, you still have the option of applying a weight to the
> > hashed
> > > representation of each word.  The question is what weight.
> > >
> > > To build a small dictionary, you don't have to go through all of the
> > data.
> > >  Just enough to get reasonably accurate weights for most words.  All
> > words
> > > not yet seen can be assumed to be rare and thus get the nominal
> > "rare-word"
> > > weight.
> > >
> > > Keeping track of the dictionary of weights is, indeed, a pain.
> > >
> > >
> > >
> > > > The
> > > > reason I like the hashed encoders is that vectorizing can be done in
> a
> > > > streaming manner at the last possible moment. With the current tools
> > you
> > > > have to do: data -> data2seq -> seq2sparse -> kmeans.
> > > >
> > >
> > > Indeed.  That is the great virtue.
> > >
> > >
> > > >
> > > > If this approach is doable I would like to code up a Java non-Hadoop
> > > > example using the Reuters dataset which vectorizes each doc using the
> > > > hashing encoders, configures KMeans with Hamming distance and then
> > write
> > > > some code to get the labels.
> > > >
> > >
> > > Use Euclidean distance, not Hamming.
> > >
> > > You can definitely use the AWE here if you randomize document ordering.
> > >
> >
>

Reply via email to