Well, there is also an online method of LDA in Spark ... Pat, is there any documentation on the method you described?
On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel <[email protected]> wrote: > The method I described calculates similarity on the fly but requires new > docs to go through feature extraction before similarity can be queried. The > length of time to do feature extraction is short compared to training LDA. > > Another method that gets at semantic similarity uses adaptive skip-grams > for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this > but a friend saw a presentation about using this method to create features > for a search engine which showed a favorable comparison with word2vec. > > If you want to use LDA note that it is an unsupervised categorization > method. To use it, the cluster descriptors (a vector of important terms) > can be compared to the analyzed incoming document using a KNN/search > engine. This will give you a list of the closest clusters but doesn’t > really give you documents, which is your goal I think. LDA should be re-run > periodically to generate new clusters. Do you want to know cluster > inclusion or get a list of similar docs? > > On Feb 23, 2016, at 1:01 PM, David Starina <[email protected]> > wrote: > > Guys, one more question ... Are there some incremental methods to do this? > I don't want to run the whole job again once a new document is added. In > case of LDA ... I guess the best way is to calculate the topics on the new > document using the topics from the previous LDA run ... And then every once > in a while to recalculate the topics with the new documents? > > On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <[email protected]> > wrote: > > > Something we are working on for purely content based similarity is using > a > > KNN engine (search engine) but creating features from word2vec and an NER > > (Named Entity Recognizer). > > > > putting the generated features into fields of a doc can really help with > > similarity because w2v and NER create semantic features. You can also try > > n-grams or skip-grams. These features are not very helpful for search but > > for similarity they work well. > > > > The query to the KNN engine is a document, each field mapped to the > > corresponding field of the index. The result is the k nearest neighbors > to > > the query doc. > > > > > >> On Feb 14, 2016, at 11:05 AM, David Starina <[email protected]> > > wrote: > >> > >> Charles, thank you, I will check that out. > >> > >> Ted, I am looking for semantic similarity. Unfortunately, I do not have > > any > >> data on the usage of the documents (if by usage you mean user behavior). > >> > >> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <[email protected]> > > wrote: > >> > >>> Did you want textual similarity? > >>> > >>> Or semantic similarity? > >>> > >>> The actual semantics of a message can be opaque from the content, but > > clear > >>> from the usage. > >>> > >>> > >>> > >>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <[email protected]> > > wrote: > >>> > >>>> David, > >>>> LDA or LSI can work quite nicely for similarity (YMMV of course > > depending > >>>> on the characterization of your documents). > >>>> You basically use the dot product of the square roots of the vectors > > for > >>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance > > that > >>>> will lead you to a good similarity or distance measure. > >>>> As I recall, Spark does provide an LDA implementation. Gensim provides > > a > >>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also > > worth > >>>> looking at, particularly for a large dataset. > >>>> Hope this is useful. > >>>> Cheers > >>>> > >>>> Sent from my iPhone > >>>> > >>>>> On Feb 14, 2016, at 8:14 AM, David Starina <[email protected]> > >>>> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> I need to build a system to determine N (i.e. 10) most similar > >>> documents > >>>> to > >>>>> a given document. I have some (theoretical) knowledge of Mahout > >>>> algorithms, > >>>>> but not enough to build the system. Can you give me some suggestions? > >>>>> > >>>>> At first I was researching Latent Semantic Analysis for the task, but > >>>> since > >>>>> Mahout doesn't support it, I started researching some other options. > I > >>>> got > >>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet > >>> allocation) > >>>>> in Mahout to achieve similar and even better results. > >>>>> > >>>>> However ... and this is where I got confused ... LDA is a clustering > >>>>> algorithm. However, what I need is not to cluster the documents into > N > >>>>> clusters - I need to get a matrix (similar to TF-IDF) from which I > can > >>>>> calculate some sort of a distance for any two documents to get N most > >>>>> similar documents for any given document. > >>>>> > >>>>> How do I achieve that? My idea was (still mostly theoretical, since I > >>>> have > >>>>> some problems with running the LDA algorithm) to extract some number > > of > >>>>> topics with LDA, but I need not cluster the documents with the help > of > >>>> this > >>>>> topics, but to get the matrix of documents as one dimention and > topics > >>> as > >>>>> the other dimension. I was guessing I could then use this matrix an > an > >>>>> input to row-similarity algorithm. > >>>>> > >>>>> Is this the correct concept? Or am I missing something? > >>>>> > >>>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve > >>>>> similar results on Spark? > >>>>> > >>>>> > >>>>> Thanks in advance, > >>>>> David > >>>> > >>> > > > > > >
