[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Mark Harwood (JIRA) Mon, 26 Jul 2010 06:58:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892311#action_12892311
 ]


Mark Harwood commented on LUCENE-2557:
--------------------------------------

bq. I dont understand why we need to average any idfs? this seems really costly 

IDF lookups and averaging etc should only be calculated for the top "n" terms 
that finally make it into the query. "Top" in this case being some edit 
distance threshold or synonymity measure. All the required doc frequency info 
for IDF is available in RAM on TermEnum which is iterated across anyway and so 
shouldn't incur any extra disk seeks. So given a query that expands to 1,000 
terms the cost of computing the average IDF for that set of terms is surely 
lost in the cost of 1,000 disk seeks on the TermDocs as part of query 
evaluation? I need to review the code to remind myself of how it is processed 
but it feels like it should be cheap.

bq. average docfreq across all 50 terms even, maybe the top-5 or so is 
sufficient.

That could work. The IDF score simply has to be a value that is used as a 
constant for all the expanded terms in a fuzzy query and, as an added bonus, 
represents a value that can be usefully contrasted with other query clauses.  
The averaging policy is just a fall-back position in the rarer situations when 
a user's original input term has no associated IDF value we can use. If this 
policy is a performance concern then we could reduce the number of terms as you 
suggest or just ignore IDF entirely in this case but I'm not sure the averaging 
costs represent any kind of real performance concern given the IO costs of 
accessing TermDocs.

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact 
> match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for 
> "smith", the misspellings such as "smiith", or "smiht" would appear near the 
> top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Reply via email to