> > I would like to implement the Okapi BM25 weighting function > > using my own Similarity implementation. Unfortunately BM25 > > requires the document length in the score calculation, which > > is not provided by the Scorer. > > How do you want to measure document length? If the number of > tokens is an acceptable measure, then the norm contains > sqrt(numTokens) by default. You can modify your > Similarity.lengthNorm() implementation to not perform the > sqrt, or square the norm.
I assume the number of tokens will be a good estimate. I've included an image with the algorithm (my ASCII art isn't that good). Legend of the figure: - k1, k3 and b are constants - tf is the within document term frequency - df is the document frequency - N is the collection size - r is the number of relevant documents containing a particular term (without relevance information assumed to be 0) - R is the number of items known to be relevant to a specific topic (without relevance information assumed to be 0) As far is I understand Lucene multiplies the squared weight with the result of Similarity.lengthNorm(), but BM25 requires the document length for the calculation of the document term weighting (as far as I know it's not possible to extract the influence of the normalization as a constant multiplier). Am I missing something here? Dolf
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]