odelmarcelle opened a new issue, #13152: URL: https://github.com/apache/lucene/issues/13152
### Description I've been using Lucene (through OpenSearch) for querying and scoring human-written documents (!= logs). I often use sloppy phrase queries to handle languages variations which express similar meaning. For example, one of my query contained `"frauduleus faillissement"~3` (in Dutch. This is equivalent to "fraudulent bankruptcy" in English). Thanks to the sloppy match, the following two sentences are correctly matched by that query: 1. Dit faillissement is frauduleus (This bankruptcy is fraudulent) 2. Dit is een frauduleus faillissement (This is a fraudulent bankruptcy) While the two sentences are very similar in meaning, the way they are scored by the `SloppyPhraseScorer` is very different. According to the following lines the sentences will have respectively a frequency of `0.25` and `1`, leading to a large difference in the resulting relevance score. https://github.com/apache/lucene/blob/3ce9ba9fd51a9b4e7228d81e19acbdb8b18f4e12/lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java#L166-L169, While I understand the intent of `sloppyWeight()` to penalize sloppy matches that are different from the exact match, I feel that the penalty is way too strong. As a user, I deliberately make the choice to look for sloppy phrase matches and I wouldn't expect that such a strong penalty would apply. For that particular example, I would obtain a more accurate scoring by using multiple exact phrase queries. Browsing the history of `SloppyPhraseScorer`, I see that this scoring approach has been in place for years and I did not find any issue discussing that implementation. Even though my use case might be niche, I believe a revision of that scoring method could greatly benefit applications on human-written texts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
