[
https://issues.apache.org/jira/browse/LUCENE-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555263#comment-15555263
]
Michael McCandless commented on LUCENE-7480:
--------------------------------------------
I am not familiar with {{LMDirichletSimilarity}} in particular, but there are
two phases in general for a similarity.
Phase 1 is done up front by checking the term statistics across the entire
index, in {{LMSimilarity.fillBasicStats}}.
Phase 2 is done per-segment, which is the code you are pointing to in
{{BooleanWeight}}: when {{subScorer}} is {{null}} that means the requested term
(or sub-query) never appears at all in the current segment. But this is not
supposed to alter how scoring works, since Phase 1 should have computed stats
for all terms in the query.
Maybe you can make a test case showing that the score is incorrect in Lucene's
implementation vs the original formula?
> Wrong Formula in LMDirichletSimilarity
> --------------------------------------
>
> Key: LUCENE-7480
> URL: https://issues.apache.org/jira/browse/LUCENE-7480
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Shayan Tabrizi
>
> It seems that LMDirichletSimilarity only calculates "score" method if the
> term occurs in the document. Otherwise, in line 389 of BooleanWeight (Lucene
> 6.2.0) subScorer becomes null, and thus the clause is not added to the
> optional list in order to be scored.
> However, in the original formula of LM
> (http://www.stat.uchicago.edu/~lafferty/pdf/smooth-tois.pdf, formula 6), we
> have "n log a_d" (n is the number of query terms). Therefore, even for the
> query terms not present in the document a "log a_d" must be added to the
> final score.
> But the implementation of LMDirichletSimilarity adds "log a_d" to the score
> in the "score" method, and therefore it is only added to the final score for
> the query terms present in the document.
> This can worsen the retrieval results compared to the correct formula. I
> tried to correct this for myself but because of the plenty of "final" methods
> and classes, I was not successful. Please, check the problem and solve it if
> approved, and also please tell me how I can correct it before a new release
> is published.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]