odelmarcelle opened a new issue, #13152:
URL: https://github.com/apache/lucene/issues/13152

   ### Description
   
   I've been using Lucene (through OpenSearch) for querying and scoring 
human-written documents (!= logs). I often use sloppy phrase queries to handle 
languages variations which express similar meaning.
   
   For example, one of my query contained `"frauduleus faillissement"~3` (in 
Dutch. This is equivalent to "fraudulent bankruptcy" in English). Thanks to the 
sloppy match, the following two sentences are correctly matched by that query:
   1. Dit faillissement is frauduleus (This bankruptcy is fraudulent)
   2. Dit is een frauduleus faillissement (This is a fraudulent bankruptcy)
   
   While the two sentences are very similar in meaning, the way they are scored 
by the `SloppyPhraseScorer` is very different. According to the following lines 
the sentences will have respectively  a frequency of `0.25` and `1`, leading to 
a large difference in the resulting relevance score. 
   
https://github.com/apache/lucene/blob/3ce9ba9fd51a9b4e7228d81e19acbdb8b18f4e12/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java#L166-L169,
 
   
   While I understand the intent of `sloppyWeight()` to penalize sloppy matches 
that are different from the exact match, I feel that the penalty is way too 
strong. As a user, I deliberately make the choice to look for sloppy phrase 
matches and I wouldn't expect that such a strong penalty would apply. For that 
particular example, I would obtain a more accurate scoring by using multiple 
exact phrase queries.
   
   Browsing the history of `SloppyPhraseScorer`, I see that this scoring 
approach has been in place for years and I did not find any issue discussing 
that implementation. Even though my use case might be niche, I believe a 
revision of that scoring method could greatly benefit applications on 
human-written texts.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to