> > I would like to implement the Okapi BM25 weighting function 
> > using my own Similarity implementation. Unfortunately BM25 
> > requires the document length in the score calculation, which 
> > is not provided by the Scorer.
> 
> How do you want to measure document length?  If the number of 
> tokens is an acceptable measure, then the norm contains 
> sqrt(numTokens) by default.  You can modify your 
> Similarity.lengthNorm() implementation to not perform the 
> sqrt, or square the norm.

I assume the number of tokens will be a good estimate.

I've included an image with the algorithm (my ASCII art isn't that good).
Legend of the figure:
- k1, k3 and b are constants
- tf is the within document term frequency
- df is the document frequency
- N is the collection size
- r is the number of relevant documents containing a particular term (without 
relevance information assumed to be 0)
- R is the number of items known to be relevant to a specific topic (without 
relevance information assumed to be 0)

As far is I understand Lucene multiplies the squared weight with the result of 
Similarity.lengthNorm(), but BM25 requires the document length for the 
calculation of the document term weighting (as far as I know it's not possible 
to extract the influence of the normalization as a constant multiplier).

Am I missing something here?

Dolf



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to