a "fair" similarity

Daniel Naber Mon, 14 Aug 2006 17:21:05 -0700

Hi,

as some of you may have noticed, Lucene prefers shorter documents over 
longer ones, i.e. shorter documents get a higher ranking, even if the 
ratio "matched terms / total terms in document" is the same.


For example, take these two artificial documents:

doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

When searching for "x" doc1 will get a higher ranking, even though "x" 
makes up 1/10 of the terms in both documents.

Using this similarity implementation seems to "fix" that:

class MySim extends DefaultSimilarity {
 
  public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / numTerms);
  }
  
  public float tf(float freq) {
    return (float)freq;
  }

}

It's basically just the default implementation with Math.sqrt() removed. Is 
this the correct approach? Are there any problems to expect? I just tested 
it with the documents cited above.

The use case is that I want to boost fields, e.g. "body:foo^2 title:blah". 
This could lead to strange results if title is already preferred just 
because it's shorter.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

a "fair" similarity

Reply via email to