[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Mark Harwood (JIRA) Mon, 26 Jul 2010 05:25:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892285#action_12892285
 ]


Mark Harwood commented on LUCENE-2557:
--------------------------------------

I think we're agreed that the effects of IDF are troublesome when ranking 
variant term matches but I question that the default solution should be to 
remove IDF from the equation completely.

Doing that reminds me of the time my mother thought the shadow in a photograph 
was annoying and cut it out with a pair of scissors leaving a big hole in its 
place.
What we're proposing here instead is the equivalent of some "photoshopping" to 
retain some of the original information but suitably blurred to provide a more 
natural balance to the overall picture.

Some degree of  IDF can be usefully retained from a FuzzyQuery in order to 
acheive balance with all the other (potentially non-fuzzy) optional clauses 
that may exist in a BooleanQuery. 
The proposal is that the most natural blending of IDF scores within a 
FuzzyQuery is to use only the IDF of the input term (which defines the user's 
original intent) and use this to score a match on any suggested variant . If 
the input term does not exist the average IDF of all variants is used as the 
next best alternative for scoring each variant.

This approach has exactly the same ranking effect as the existing "remove IDF" 
policy within a single FuzzyQuery but has the added advantage of sitting better 
with the other optional clauses that may exist in a containing query.

The question over core vs contrib comes down to what is considered the more 
natural/expected behaviour. 


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact 
> match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for 
> "smith", the misspellings such as "smiith", or "smiht" would appear near the 
> top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Reply via email to