[
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892311#action_12892311
]
Mark Harwood commented on LUCENE-2557:
--------------------------------------
bq. I dont understand why we need to average any idfs? this seems really costly
IDF lookups and averaging etc should only be calculated for the top "n" terms
that finally make it into the query. "Top" in this case being some edit
distance threshold or synonymity measure. All the required doc frequency info
for IDF is available in RAM on TermEnum which is iterated across anyway and so
shouldn't incur any extra disk seeks. So given a query that expands to 1,000
terms the cost of computing the average IDF for that set of terms is surely
lost in the cost of 1,000 disk seeks on the TermDocs as part of query
evaluation? I need to review the code to remind myself of how it is processed
but it feels like it should be cheap.
bq. average docfreq across all 50 terms even, maybe the top-5 or so is
sufficient.
That could work. The IDF score simply has to be a value that is used as a
constant for all the expanded terms in a fuzzy query and, as an added bonus,
represents a value that can be usefully contrasted with other query clauses.
The averaging policy is just a fall-back position in the rarer situations when
a user's original input term has no associated IDF value we can use. If this
policy is a performance concern then we could reduce the number of terms as you
suggest or just ignore IDF entirely in this case but I'm not sure the averaging
costs represent any kind of real performance concern given the IO costs of
accessing TermDocs.
> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
> Key: LUCENE-2557
> URL: https://issues.apache.org/jira/browse/LUCENE-2557
> Project: Lucene - Java
> Issue Type: Bug
> Components: Query/Scoring
> Affects Versions: 3.0.2
> Reporter: Jingkei Ly
> Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact
> match, which seems to be an undesirable property generally.
> For example, in an index of surnames, if I search using a FuzzyQuery for
> "smith", the misspellings such as "smiith", or "smiht" would appear near the
> top of the search results ahead of documents that match "smith".
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]