Something we are working on for purely content based similarity is using a KNN engine (search engine) but creating features from word2vec and an NER (Named Entity Recognizer).
putting the generated features into fields of a doc can really help with similarity because w2v and NER create semantic features. You can also try n-grams or skip-grams. These features are not very helpful for search but for similarity they work well. The query to the KNN engine is a document, each field mapped to the corresponding field of the index. The result is the k nearest neighbors to the query doc. > On Feb 14, 2016, at 11:05 AM, David Starina <[email protected]> wrote: > > Charles, thank you, I will check that out. > > Ted, I am looking for semantic similarity. Unfortunately, I do not have any > data on the usage of the documents (if by usage you mean user behavior). > > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <[email protected]> wrote: > >> Did you want textual similarity? >> >> Or semantic similarity? >> >> The actual semantics of a message can be opaque from the content, but clear >> from the usage. >> >> >> >> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <[email protected]> wrote: >> >>> David, >>> LDA or LSI can work quite nicely for similarity (YMMV of course depending >>> on the characterization of your documents). >>> You basically use the dot product of the square roots of the vectors for >>> LDA -- if you do a search for Hellinger or Bhattachararyya distance that >>> will lead you to a good similarity or distance measure. >>> As I recall, Spark does provide an LDA implementation. Gensim provides a >>> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth >>> looking at, particularly for a large dataset. >>> Hope this is useful. >>> Cheers >>> >>> Sent from my iPhone >>> >>>> On Feb 14, 2016, at 8:14 AM, David Starina <[email protected]> >>> wrote: >>>> >>>> Hi, >>>> >>>> I need to build a system to determine N (i.e. 10) most similar >> documents >>> to >>>> a given document. I have some (theoretical) knowledge of Mahout >>> algorithms, >>>> but not enough to build the system. Can you give me some suggestions? >>>> >>>> At first I was researching Latent Semantic Analysis for the task, but >>> since >>>> Mahout doesn't support it, I started researching some other options. I >>> got >>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet >> allocation) >>>> in Mahout to achieve similar and even better results. >>>> >>>> However ... and this is where I got confused ... LDA is a clustering >>>> algorithm. However, what I need is not to cluster the documents into N >>>> clusters - I need to get a matrix (similar to TF-IDF) from which I can >>>> calculate some sort of a distance for any two documents to get N most >>>> similar documents for any given document. >>>> >>>> How do I achieve that? My idea was (still mostly theoretical, since I >>> have >>>> some problems with running the LDA algorithm) to extract some number of >>>> topics with LDA, but I need not cluster the documents with the help of >>> this >>>> topics, but to get the matrix of documents as one dimention and topics >> as >>>> the other dimension. I was guessing I could then use this matrix an an >>>> input to row-similarity algorithm. >>>> >>>> Is this the correct concept? Or am I missing something? >>>> >>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve >>>> similar results on Spark? >>>> >>>> >>>> Thanks in advance, >>>> David >>> >>
