[jira] [Comment Edited] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

Elizabeth Haubert (JIRA) Mon, 12 Nov 2018 08:01:51 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683976#comment-16683976
 ]


Elizabeth Haubert edited comment on LUCENE-8563 at 11/12/18 4:00 PM:
---------------------------------------------------------------------

The boost*IDF is not particularly important, this is about the handling of the 
TF component relative to the norms. 

Pull that out as 
{code:java}
(tf + k1*tf) / (tf + k1*length_norms)
{code}

Removing it only from the numerator produces 
{code:java}
 tf / (tf +k1* length norms) 
{code}

At a minimum, that will need a new empirical default for k1. 

Changing k1 in the numerator is the knob to adjust the ratio of tf and norms.   
In the case where document length does not follow standard models, it can be 
helpful to damp down b.  This is not the standard use case, but is not unusual, 
either.  At the extreme,  b=0 then this component reduces to 
{code:java}
(tf * (k1 +1)) / (tf + k1)
{code}

Removing the (k1 +1) from the numerator only produces 
{code:java}
tf / (tf + k1)
{code}

There will be cases where this affects relative scoring and ranking, and I 
don't understand the statement that it doesn't modify ordering.

If there is a need to remove it in the normal case, then perhaps the numerator 
and denominator should be split into two distinct constants.








was (Author: ehaubert):
The boost*IDF is not particularly important, this is about the handling of the 
TF component relative to the norms. 

Pull that out as 
{code:java}
(tf + tf*k1) / (tf + k1*length_norms)
{code}

Removing it only from the numerator produces 
{code:java}
 tf / (tf +k1* length norms) 
{code}

At a minimum, that will need a new empirical default for k1. 

Changing k1 in the numerator is the knob to adjust the ratio of tf and norms.   
In the case where document length does not follow standard models, it can be 
helpful to damp down b.  This is not the standard use case, but is not unusual, 
either.  At the extreme,  b=0 then this component reduces to 
{code:java}
(tf * (k1 +1)) / (tf + k1)
{code}

Removing the (k1 +1) from the numerator only produces 
{code:java}
tf / (tf + k1)
{code}

There will be cases where this affects relative scoring and ranking, and I 
don't understand the statement that it doesn't modify ordering.

If there is a need to remove it in the normal case, then perhaps the numerator 
and denominator should be split into two distinct constants.







> Remove k1+1 from the numerator of  BM25Similarity
> -------------------------------------------------
>
>                 Key: LUCENE-8563
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8563
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

Reply via email to