[
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686877#comment-16686877
]
Michael Gibney commented on LUCENE-8563:
----------------------------------------
"assuming a single similarity" -- is this something that we want to assume? If
every field similarity uses the same k1 param, then sure, relative ordering
among fields is maintained. But if we're using these scores outside of the
context of single-similarity, and intend to preserve the ability to adjust the
k1 param, it's worth noting that this change fundamentally alters the effect of
the k1 param on absolute scores (and thus also on relative scores across
similarities).
Namely, removing k1 from the numerator places a hard cap on the score,
regardless of TF or k1 setting. The concept of saturation is preserved, but
with no numerator k1, saturation is implemented strictly by depressing scores
(with respect to the hard cap, by varying amounts according to TF) as k1
increases. The model with k1 in the numerator strikes me as being more
flexible, both depressing scores for lower TF _and increasing_ scores for
higher TF, around an inflection point determined by length norms and the value
of b.
I'm sure this change would be appropriate for some scenarios, but it's a
fundamental change that could in some cases have significant downstream
consequences, with no easy way (as far as I can tell) to maintain existing
behavior.
> Remove k1+1 from the numerator of BM25Similarity
> -------------------------------------------------
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify
> ordering. It is often omitted and I found out that the "The Probabilistic
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and
> Zaragova even describes adding (k1+1) to the numerator as a variant whose
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
> numerator of the saturation function. This is the same for all
> terms, and therefore does not affect the ranking produced.
> The reason for including it was to make the final formula
> more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score
> contributions (eg. via oal.document.FeatureField) would be a bit easier to
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%)
> rather than a term whose IDF is 3/(k1 + 1).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]