On Mar 25, 2010, at 3:07 PM, Mathias Silbermann wrote:

> Dear Lucene Users,
> 
> I'd like to use Lucene to find scientific papers in the index that are 
> similar to a given paper from the
> index. This seems to be possible using the MoreLikeThis-feature or wrapping 
> the given document
> in a query composed of several other queries (BooleanQuery). The similarity 
> is calculated
> according to Lucene's Practical Scoring Function defined in the JavaDoc of 
> class Similarity.
> 
> What I am trying to do is to calculate the "semantic document similarity". 
> One example similarity
> function for that purpose is given on page two of the paper "Corpus-based and 
> Knowledge-based
> Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead 
> of using the TF and
> IDF values, it uses IDF values and the relatednesses between every unique 
> words in the documents
> to compare. First, it sums up the relatednesses of each unique word in 
> document 1 to its most
> related word in document 2 multiplied by its IDF value. The same procedure is 
> done for document1.
> After that, the sums are averaged.
> 

Interesting.

> My question is: Given I am able to store WordNet-Words extracted from the 
> documents in the
> index and pre-calculate the word-word similarities, is it possibe / does it 
> make sense (e.g. from
> the (computational) effort point of view) to adapt the Practical Scoring 
> Function to such a function
> of semantic document similarity? And where (in which class) is the Practical 
> Scoring Function
> implemented, i.e. where are the values of TF, IDF, Boost... put together?
> 

This stuff is all done in the Scorer for a specific query (see 
TermQuery/TermScorer for an example).  

Just thinking out loud here, but I think you will need to write your own Query 
to do this. I'm not entirely certain on what that means for you, though.  Seems 
like a FunctionQuery might help, too.   Seems like, just possibly, Lucene is a 
bit of overkill here other than using it to get IDF values.  Can't you just 
create a big matrix (maybe w/ Hadoop and HBase or something similar) of your 
precomputed similarities and then just lookups on the document?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to