Ah, interesting. I am going to try it out.

Thanks for your comments!


On Fri, Mar 21, 2014 at 9:29 PM, Johannes Schulte <
[email protected]> wrote:

> Hi frank,
>
> no, no collocation job. You just take a big enough sample of documents and
> assign it to it's cluster with the learned ClusterClassifier. Parallel to
> that you count the total words in a guava multiset and the per cluster word
> counts in a multiset. The LogLikelihood class contains a convenient method
> that takes two multisets that you use for all clusters.
>
> there should be no need in starting a map reduce job for that, with some
> ram you can just stream the documents from the hdfs
>
>
>
>
> On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten <[email protected]
> >wrote:
>
> > Hi Johannes,
> >
> > Sounds good.
> >
> > The step for finding labels is still unclear to me. You use the
> > Loglikelihood class on the original documents? How? Or do you mean the
> > collocation job?
> >
> > Cheers,
> >
> > Frank
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
> > [email protected]> wrote:
> >
> > > Hi Frank, we are using a very similar system in production.
> > > Hashing text like data to a 50000 dimensional vector with two probes,
> and
> > > then applying tf-idf weighting.
> > >
> > > For IDF we dont keep a separate weight dictionary but just count the
> > > distinct training examples ("documents") that have a non null value per
> > > column.
> > > so there is a full idf vector that can be used.
> > > Instead of Euclidean Distance we use Cosine (Performance Reasons).
> > >
> > > The results are very good, building such a system is easy and maybe
> it's
> > > worth a try.
> > >
> > > For representing the cluster we have a separate job that assigns users
> > > ("documents") to clusters and shows the most discriminating words for
> the
> > > cluster via the LogLikelihood class. The results are then visualized
> > using
> > > http://wordcram.org/ for the whoah effect.
> > >
> > > Cheers,
> > >
> > > Johannes
> > >
> > >
> > > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <
> > [email protected]
> > > > >wrote:
> > > >
> > > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Yes.  Hashing vector encoders will preserve distances when used
> > with
> > > > > > multiple probes.
> > > > > >
> > > > >
> > > > > So if a token occurs two times in a document the first token will
> be
> > > > mapped
> > > > > to a given location and when the token is hashed the second time it
> > > will
> > > > be
> > > > > mapped to a different location, right?
> > > > >
> > > >
> > > > No.  The same token will always hash to the same location(s).
> > > >
> > > >
> > > > > I am wondering if when many probes are used and a large enough
> vector
> > > > this
> > > > > process mimics TF weighting, since documents that have a high TF
> of a
> > > > given
> > > > > token will have the same positions marked in the vector. As Suneel
> > said
> > > > > when we then use the Hamming distance the vectors that are close to
> > > each
> > > > > other should be in the same cluster.
> > > > >
> > > >
> > > > Hamming distance doesn't quite work because you want to have
> collisions
> > > to
> > > > a sum rather than an OR.  Also, if you apply weights to the words,
> > these
> > > > weights will be added to all of the probe locations for the words.
> >  This
> > > > means we still need a plus/times/L2 dot product rather than an
> > > plus/AND/L1
> > > > dot product like the Hamming distance uses.
> > > >
> > > > >
> > > > > > Interpretation becomes somewhat difficult, but there is code
> > > available
> > > > to
> > > > > > reverse engineer labels on hashed vectors.
> > > > >
> > > > >
> > > > > I saw that AdaptiveWordEncoder has a built in dictionary so I can
> see
> > > > which
> > > > > words it has seen but I don't see how to go from a position or
> > several
> > > > > positions in the vector to labels. Is there an example in the code
> I
> > > can
> > > > > look at?
> > > > >
> > > >
> > > > Yes.  The newsgroups example applies.
> > > >
> > > > The AdaptiveWordEncoder counts word occurrences that it sees and uses
> > the
> > > > IDF based on the resulting counts.  This assumes that all instances
> of
> > > the
> > > > AWE will see the same rough distribution of words to work.  It is
> fine
> > > for
> > > > lots of applications and not fine for lots.
> > > >
> > > >
> > > > >
> > > > >
> > > > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > > > dictionary
> > > > > > of, say, the most common 50-200 thousand words and assume all
> > others
> > > > have
> > > > > > constant and equal frequency.
> > > > > >
> > > > >
> > > > > How would IDF weighting work in conjunction with hashing? First
> build
> > > up
> > > > a
> > > > > dictionary of 50-200 and pass that into the vector encoders? The
> > > drawback
> > > > > of this is that you have another pass through the data and another
> > > > 'input'
> > > > > to keep track of and configure. But maybe it has to be like that.
> > > >
> > > >
> > > > With hashing, you still have the option of applying a weight to the
> > > hashed
> > > > representation of each word.  The question is what weight.
> > > >
> > > > To build a small dictionary, you don't have to go through all of the
> > > data.
> > > >  Just enough to get reasonably accurate weights for most words.  All
> > > words
> > > > not yet seen can be assumed to be rare and thus get the nominal
> > > "rare-word"
> > > > weight.
> > > >
> > > > Keeping track of the dictionary of weights is, indeed, a pain.
> > > >
> > > >
> > > >
> > > > > The
> > > > > reason I like the hashed encoders is that vectorizing can be done
> in
> > a
> > > > > streaming manner at the last possible moment. With the current
> tools
> > > you
> > > > > have to do: data -> data2seq -> seq2sparse -> kmeans.
> > > > >
> > > >
> > > > Indeed.  That is the great virtue.
> > > >
> > > >
> > > > >
> > > > > If this approach is doable I would like to code up a Java
> non-Hadoop
> > > > > example using the Reuters dataset which vectorizes each doc using
> the
> > > > > hashing encoders, configures KMeans with Hamming distance and then
> > > write
> > > > > some code to get the labels.
> > > > >
> > > >
> > > > Use Euclidean distance, not Hamming.
> > > >
> > > > You can definitely use the AWE here if you randomize document
> ordering.
> > > >
> > >
> >
>

Reply via email to