[
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892285#action_12892285
]
Mark Harwood commented on LUCENE-2557:
--------------------------------------
I think we're agreed that the effects of IDF are troublesome when ranking
variant term matches but I question that the default solution should be to
remove IDF from the equation completely.
Doing that reminds me of the time my mother thought the shadow in a photograph
was annoying and cut it out with a pair of scissors leaving a big hole in its
place.
What we're proposing here instead is the equivalent of some "photoshopping" to
retain some of the original information but suitably blurred to provide a more
natural balance to the overall picture.
Some degree of IDF can be usefully retained from a FuzzyQuery in order to
acheive balance with all the other (potentially non-fuzzy) optional clauses
that may exist in a BooleanQuery.
The proposal is that the most natural blending of IDF scores within a
FuzzyQuery is to use only the IDF of the input term (which defines the user's
original intent) and use this to score a match on any suggested variant . If
the input term does not exist the average IDF of all variants is used as the
next best alternative for scoring each variant.
This approach has exactly the same ranking effect as the existing "remove IDF"
policy within a single FuzzyQuery but has the added advantage of sitting better
with the other optional clauses that may exist in a containing query.
The question over core vs contrib comes down to what is considered the more
natural/expected behaviour.
> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
> Key: LUCENE-2557
> URL: https://issues.apache.org/jira/browse/LUCENE-2557
> Project: Lucene - Java
> Issue Type: Bug
> Components: Query/Scoring
> Affects Versions: 3.0.2
> Reporter: Jingkei Ly
> Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact
> match, which seems to be an undesirable property generally.
> For example, in an index of surnames, if I search using a FuzzyQuery for
> "smith", the misspellings such as "smiith", or "smiht" would appear near the
> top of the search results ahead of documents that match "smith".
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]