Dear Lucene Users,
I'd like to use Lucene to find scientific papers in the index that are
similar to a given paper from the
index. This seems to be possible using the MoreLikeThis-feature or
wrapping the given document
in a query composed of several other queries (BooleanQuery). The
similarity is calculated
according to Lucene's Practical Scoring Function defined in the JavaDoc
of class Similarity.
What I am trying to do is to calculate the "semantic document
similarity". One example similarity
function for that purpose is given on page two of the paper
"Corpus-based and Knowledge-based
Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1).
Instead of using the TF and
IDF values, it uses IDF values and the relatednesses between every
unique words in the documents
to compare. First, it sums up the relatednesses of each unique word in
document 1 to its most
related word in document 2 multiplied by its IDF value. The same
procedure is done for document1.
After that, the sums are averaged.
My question is: Given I am able to store WordNet-Words extracted from
the documents in the
index and pre-calculate the word-word similarities, is it possibe / does
it make sense (e.g. from
the (computational) effort point of view) to adapt the Practical Scoring
Function to such a function
of semantic document similarity? And where (in which class) is the
Practical Scoring Function
implemented, i.e. where are the values of TF, IDF, Boost... put together?
Regards,
Mathias Silbermann
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org