The problem with 2c is that scores are currently relative, and not
absolute.  I am hoping Chuck's patch makes it into the source, as
making scores absolute would be helpful in situations like this one.

Otis


--- David Spencer <[EMAIL PROTECTED]> wrote:
> Miles Barr wrote:
> 
> > Has anyone tried to remove similar documents from their search
> results?
> > It looks like Google does some on the fly filtering of the results,
> > hiding pages which is thinks are too similar, i.e. when you see:
> > 
> > "In order to show you the most relevant results, we have omitted
> some
> > entries very similar to the 7 already displayed.
> > If you like, you can repeat the search with the omitted results
> > included."
> > 
> > at the bottom of the page.
> > 
> > Is there anything in Lucene or one of the contrib packages that
> compares
> > two documents?
> 
> Yes, in theory the "similarity" package in the sandbox can help.
> The code generates a query for a source document to find documents
> that 
> are similar to it - the MoreLikeThis class uses the heuristic that 2 
> docs are similar if they share "interesting" words. "Interesting"
> words 
> are words that are common in a source doc but not too common in the 
> corpus. If you were do do this you'd do something like this:
> 
> [1] Do your normal query
> [2] As you loop thru the results, for every doc
> [2a]  generate a similarity query
> [2b]  requery the index for similar docs
> [2c]  then, maybe, for every doc from [2b] with a score above some 
> threshold, it it's also high up in the results from [2] then "hide"
> the 
> doc a la google et. al.
> 
> Could be tricky coding. Another way is to only show 1 doc from any
> given 
> domain. Note that instead of 1 query you'll have "1+n" queries for
> the 
> display of "n" search results.
> 
> 
> 
> 
> Similarity links:
> 
> Source control:
> 
> 
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/
> 
> My weblog entry about the code being checked in:
> 
>       http://searchmorph.com/weblog/index.php?id=44
> 
> Javadoc of it that I host:
> 
> 
>
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html
> 
> 
> -- Dave
> 
> 
> > 
> > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to