Re: Document similarity

David Starina Fri, 11 Mar 2016 05:12:08 -0800

Well, there is also an online method of LDA in Spark ... Pat, is there any
documentation on the method you described?


On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel <[email protected]> wrote:

> The method I described calculates similarity on the fly but requires new
> docs to go through feature extraction before similarity can be queried. The
> length of time to do feature extraction is short compared to training LDA.
>
> Another method that gets at semantic similarity uses adaptive skip-grams
> for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this
> but a friend saw a presentation about using this method to create features
> for a search engine which showed a favorable comparison with word2vec.
>
> If you want to use LDA note that it is an unsupervised categorization
> method. To use it, the cluster descriptors (a vector of important terms)
> can be compared to the analyzed incoming document using a KNN/search
> engine. This will give you a list of the closest clusters but doesn’t
> really give you documents, which is your goal I think. LDA should be re-run
> periodically to generate new clusters. Do you want to know cluster
> inclusion or get a list of similar docs?
>
> On Feb 23, 2016, at 1:01 PM, David Starina <[email protected]>
> wrote:
>
> Guys, one more question ... Are there some incremental methods to do this?
> I don't want to run the whole job again once a new document is added. In
> case of LDA ... I guess the best way is to calculate the topics on the new
> document using the topics from the previous LDA run ... And then every once
> in a while to recalculate the topics with the new documents?
>
> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <[email protected]>
> wrote:
>
> > Something we are working on for purely content based similarity is using
> a
> > KNN engine (search engine) but creating features from word2vec and an NER
> > (Named Entity Recognizer).
> >
> > putting the generated features into fields of a doc can really help with
> > similarity because w2v and NER create semantic features. You can also try
> > n-grams or skip-grams. These features are not very helpful for search but
> > for  similarity they work well.
> >
> > The query to the KNN engine is a document, each field mapped to the
> > corresponding field of the index. The result is the k nearest neighbors
> to
> > the query doc.
> >
> >
> >> On Feb 14, 2016, at 11:05 AM, David Starina <[email protected]>
> > wrote:
> >>
> >> Charles, thank you, I will check that out.
> >>
> >> Ted, I am looking for semantic similarity. Unfortunately, I do not have
> > any
> >> data on the usage of the documents (if by usage you mean user behavior).
> >>
> >> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <[email protected]>
> > wrote:
> >>
> >>> Did you want textual similarity?
> >>>
> >>> Or semantic similarity?
> >>>
> >>> The actual semantics of a message can be opaque from the content, but
> > clear
> >>> from the usage.
> >>>
> >>>
> >>>
> >>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <[email protected]>
> > wrote:
> >>>
> >>>> David,
> >>>> LDA or LSI can work quite nicely for similarity (YMMV of course
> > depending
> >>>> on the characterization of your documents).
> >>>> You basically use the dot product of the square roots of the vectors
> > for
> >>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
> > that
> >>>> will lead you to a good similarity or distance measure.
> >>>> As I recall, Spark does provide an LDA implementation. Gensim provides
> > a
> >>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
> > worth
> >>>> looking at, particularly for a large dataset.
> >>>> Hope this is useful.
> >>>> Cheers
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>> On Feb 14, 2016, at 8:14 AM, David Starina <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I need to build a system to determine N (i.e. 10) most similar
> >>> documents
> >>>> to
> >>>>> a given document. I have some (theoretical) knowledge of Mahout
> >>>> algorithms,
> >>>>> but not enough to build the system. Can you give me some suggestions?
> >>>>>
> >>>>> At first I was researching Latent Semantic Analysis for the task, but
> >>>> since
> >>>>> Mahout doesn't support it, I started researching some other options.
> I
> >>>> got
> >>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
> >>> allocation)
> >>>>> in Mahout to achieve similar and even better results.
> >>>>>
> >>>>> However ... and this is where I got confused ... LDA is a clustering
> >>>>> algorithm. However, what I need is not to cluster the documents into
> N
> >>>>> clusters - I need to get a matrix (similar to TF-IDF) from which I
> can
> >>>>> calculate some sort of a distance for any two documents to get N most
> >>>>> similar documents for any given document.
> >>>>>
> >>>>> How do I achieve that? My idea was (still mostly theoretical, since I
> >>>> have
> >>>>> some problems with running the LDA algorithm) to extract some number
> > of
> >>>>> topics with LDA, but I need not cluster the documents with the help
> of
> >>>> this
> >>>>> topics, but to get the matrix of documents as one dimention and
> topics
> >>> as
> >>>>> the other dimension. I was guessing I could then use this matrix an
> an
> >>>>> input to row-similarity algorithm.
> >>>>>
> >>>>> Is this the correct concept? Or am I missing something?
> >>>>>
> >>>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve
> >>>>> similar results on Spark?
> >>>>>
> >>>>>
> >>>>> Thanks in advance,
> >>>>> David
> >>>>
> >>>
> >
> >
>
>

Re: Document similarity

Reply via email to