Miles Barr wrote:
Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see:
"In order to show you the most relevant results, we have omitted some entries very similar to the 7 already displayed. If you like, you can repeat the search with the omitted results included."
at the bottom of the page.
Is there anything in Lucene or one of the contrib packages that compares two documents?
Yes, in theory the "similarity" package in the sandbox can help.
The code generates a query for a source document to find documents that are similar to it - the MoreLikeThis class uses the heuristic that 2 docs are similar if they share "interesting" words. "Interesting" words are words that are common in a source doc but not too common in the corpus. If you were do do this you'd do something like this:
[1] Do your normal query
[2] As you loop thru the results, for every doc
[2a] generate a similarity query
[2b] requery the index for similar docs
[2c] then, maybe, for every doc from [2b] with a score above some threshold, it it's also high up in the results from [2] then "hide" the doc a la google et. al.
Could be tricky coding. Another way is to only show 1 doc from any given domain. Note that instead of 1 query you'll have "1+n" queries for the display of "n" search results.
Similarity links:
Source control:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/
My weblog entry about the code being checked in:
http://searchmorph.com/weblog/index.php?id=44
Javadoc of it that I host:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html
-- Dave
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]