The problem with 2c is that scores are currently relative, and not absolute. I am hoping Chuck's patch makes it into the source, as making scores absolute would be helpful in situations like this one.
Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > Miles Barr wrote: > > > Has anyone tried to remove similar documents from their search > results? > > It looks like Google does some on the fly filtering of the results, > > hiding pages which is thinks are too similar, i.e. when you see: > > > > "In order to show you the most relevant results, we have omitted > some > > entries very similar to the 7 already displayed. > > If you like, you can repeat the search with the omitted results > > included." > > > > at the bottom of the page. > > > > Is there anything in Lucene or one of the contrib packages that > compares > > two documents? > > Yes, in theory the "similarity" package in the sandbox can help. > The code generates a query for a source document to find documents > that > are similar to it - the MoreLikeThis class uses the heuristic that 2 > docs are similar if they share "interesting" words. "Interesting" > words > are words that are common in a source doc but not too common in the > corpus. If you were do do this you'd do something like this: > > [1] Do your normal query > [2] As you loop thru the results, for every doc > [2a] generate a similarity query > [2b] requery the index for similar docs > [2c] then, maybe, for every doc from [2b] with a score above some > threshold, it it's also high up in the results from [2] then "hide" > the > doc a la google et. al. > > Could be tricky coding. Another way is to only show 1 doc from any > given > domain. Note that instead of 1 query you'll have "1+n" queries for > the > display of "n" search results. > > > > > Similarity links: > > Source control: > > > http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/ > > My weblog entry about the code being checked in: > > http://searchmorph.com/weblog/index.php?id=44 > > Javadoc of it that I host: > > > http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html > > > -- Dave > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]