[
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704438#comment-16704438
]
ASF subversion and git services commented on LUCENE-8563:
---------------------------------------------------------
Commit cf016f8987e804bcd858a2a414eacdf1b3c54cf5 in lucene-solr's branch
refs/heads/master from javanna
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf016f8 ]
LUCENE-8563: Remove k1+1 constant factor from BM25 formula numerator.
Signed-off-by: Adrien Grand <[email protected]>
> Remove k1+1 from the numerator of BM25Similarity
> -------------------------------------------------
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify
> ordering. It is often omitted and I found out that the "The Probabilistic
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and
> Zaragova even describes adding (k1+1) to the numerator as a variant whose
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
> numerator of the saturation function. This is the same for all
> terms, and therefore does not affect the ranking produced.
> The reason for including it was to make the final formula
> more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score
> contributions (eg. via oal.document.FeatureField) would be a bit easier to
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%)
> rather than a term whose IDF is 3/(k1 + 1).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]