On Mar 25, 2010, at 3:07 PM, Mathias Silbermann wrote: > Dear Lucene Users, > > I'd like to use Lucene to find scientific papers in the index that are > similar to a given paper from the > index. This seems to be possible using the MoreLikeThis-feature or wrapping > the given document > in a query composed of several other queries (BooleanQuery). The similarity > is calculated > according to Lucene's Practical Scoring Function defined in the JavaDoc of > class Similarity. > > What I am trying to do is to calculate the "semantic document similarity". > One example similarity > function for that purpose is given on page two of the paper "Corpus-based and > Knowledge-based > Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead > of using the TF and > IDF values, it uses IDF values and the relatednesses between every unique > words in the documents > to compare. First, it sums up the relatednesses of each unique word in > document 1 to its most > related word in document 2 multiplied by its IDF value. The same procedure is > done for document1. > After that, the sums are averaged. >
Interesting. > My question is: Given I am able to store WordNet-Words extracted from the > documents in the > index and pre-calculate the word-word similarities, is it possibe / does it > make sense (e.g. from > the (computational) effort point of view) to adapt the Practical Scoring > Function to such a function > of semantic document similarity? And where (in which class) is the Practical > Scoring Function > implemented, i.e. where are the values of TF, IDF, Boost... put together? > This stuff is all done in the Scorer for a specific query (see TermQuery/TermScorer for an example). Just thinking out loud here, but I think you will need to write your own Query to do this. I'm not entirely certain on what that means for you, though. Seems like a FunctionQuery might help, too. Seems like, just possibly, Lucene is a bit of overkill here other than using it to get IDF values. Can't you just create a big matrix (maybe w/ Hadoop and HBase or something similar) of your precomputed similarities and then just lookups on the document? -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org