Robert Muir created LUCENE-8020:
-----------------------------------
Summary: Don't force sim to score bogus terms (e.g. docfreq=0)
Key: LUCENE-8020
URL: https://issues.apache.org/jira/browse/LUCENE-8020
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir
Today all sim formulas have to be "hacked" to deal with the fact that they may
be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with
spans and there is even a dedicated test for it. All formulas have hacks such
as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:
Instead of:
{code}
expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
{code}
they must do tricks such as:
{code}
expected = (1 + stats.getTotalTermFreq()) * docLen / (1 +
stats.getNumberOfFieldTokens());
{code}
There is no good reason for this, it is just sloppiness in the
Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't
pass terms that dont exist or bogus statistics.
It adds a lot of complexity to the scoring api and makes it difficult to have
meaningful/useful explanations, to debug problems, etc. It also makes it really
hard to add a new sim.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]