[
https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858600#comment-16858600
]
Jim Ferenczi commented on LUCENE-8840:
--------------------------------------
{quote}
I am curious to understand how including doc frequencies can be better than the
overall score. IMO, including BM25 scores gives us some additional advantages,
such as defending against cases where the overall non matching token count in a
document is significantly high. Did you see any scenarios that had relevance
troubles due to inclusion of entire BM25 scores?
{quote}
The idea of the SynonymQuery is to score the terms as if they were indexed as a
single term. I think this fits nicely with the fuzzy query. For instance
imagine a fuzzy query with the terms "bad" and "baz". With the current solution
if a document contains both terms it will rank significantly higher than
documents that contain only one of them. This can change depending on the inner
doc frequencies but this doesn't seem right IMO. On the contrary the synonym
query would give the same score to documents containing "baz" with a frequency
of 4 than another document containing "bad" and "baz" 2 times. This feels more
natural to me because we shouldn't favor documents that contain multiple
variations of the same fuzzy term.
{quote}
On a different note, I am also wondering if we should devise relevance tests
which allow us to measure the relevance impact of a change. Something added to
luceneutil should be nice. Thoughts?
{quote}
That would be great but this doesn't look like a low hanging fruit. Maybe open
a separate issue to discuss ?
{quote}
IMO if we want to restrict the contribution of each term to the blended query's
final score, then we could think of a blended scorer step which utilizes
something on the lines of BM25's term frequency saturation when merging scores
from different blended terms. WDYT?
{quote}
I am not sure I fully understand but the SynonymQuery kind of does that. It
sums the inner doc frequencies of all matching terms to ensure that the
contribution of each term to the final score is bounded.
> TopTermsBlendedFreqScoringRewrite should use SynonymQuery
> ---------------------------------------------------------
>
> Key: LUCENE-8840
> URL: https://issues.apache.org/jira/browse/LUCENE-8840
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Major
> Attachments: LUCENE-8840.patch
>
>
> Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite
> method for Fuzzy queries, uses the BlendedTermQuery to score documents that
> match the fuzzy terms. This query blends the frequencies used for scoring
> across the terms and creates a disjunction of all the blended terms. This
> means that each fuzzy term that match in a document will add their BM25 score
> contribution. We already have a query that can blend the statistics of
> multiple terms in a single scorer that sums the doc frequencies rather than
> the entire BM25 score: the SynonymQuery. Since
> https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles
> boost between 0 and 1 so it should be easy to change the default rewrite
> method for Fuzzy queries to use it instead of the BlendedTermQuery. This
> would bound the contribution of each term to the final score which seems a
> better alternative in terms of relevancy than the current solution.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]