Otis Gospodnetic wrote:
The problem with 2c is that scores are currently relative, and not absolute. I am hoping Chuck's patch makes it into the source, as making scores absolute would be helpful in situations like this one.
Good point.
If the orig MoreLikeThis query allows the source doc to be returned it might be used to normalize the scores however...
Otis
--- David Spencer <[EMAIL PROTECTED]> wrote:
Miles Barr wrote:
Has anyone tried to remove similar documents from their search
results?
It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see:
"In order to show you the most relevant results, we have omitted
some
entries very similar to the 7 already displayed. If you like, you can repeat the search with the omitted results included."
at the bottom of the page.
Is there anything in Lucene or one of the contrib packages that
compares
two documents?
Yes, in theory the "similarity" package in the sandbox can help.
The code generates a query for a source document to find documents
that are similar to it - the MoreLikeThis class uses the heuristic that 2 docs are similar if they share "interesting" words. "Interesting"
words are words that are common in a source doc but not too common in the corpus. If you were do do this you'd do something like this:
[1] Do your normal query
[2] As you loop thru the results, for every doc
[2a] generate a similarity query
[2b] requery the index for similar docs
[2c] then, maybe, for every doc from [2b] with a score above some threshold, it it's also high up in the results from [2] then "hide"
the doc a la google et. al.
Could be tricky coding. Another way is to only show 1 doc from any
given domain. Note that instead of 1 query you'll have "1+n" queries for
the display of "n" search results.
Similarity links:
Source control:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/
My weblog entry about the code being checked in:
http://searchmorph.com/weblog/index.php?id=44
Javadoc of it that I host:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html
-- Dave
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]