Robert Muir created LUCENE-8020:
-----------------------------------

             Summary: Don't force sim to score bogus terms (e.g. docfreq=0)
                 Key: LUCENE-8020
                 URL: https://issues.apache.org/jira/browse/LUCENE-8020
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Robert Muir


Today all sim formulas have to be "hacked" to deal with the fact that they may 
be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with 
spans and there is even a dedicated test for it. All formulas have hacks such 
as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:

Instead of:
{code}
expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
{code}
they must do tricks such as:
{code}
expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + 
stats.getNumberOfFieldTokens());
{code}

There is no good reason for this, it is just sloppiness in the 
Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't 
pass terms that dont exist or bogus statistics.

It adds a lot of complexity to the scoring api and makes it difficult to have 
meaningful/useful explanations, to debug problems, etc. It also makes it really 
hard to add a new sim.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to