[
https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2507:
--------------------------------
Attachment: LUCENE-2507.patch
we have sped up this seeking a lot recently, and i improved this patch some:
* avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq
* Mike had the idea that we should actually try lower edit distances first. The
general use case here is a small number of suggestions (e.g. 1), so
we actually try edit distance=1 first... only if this doesn't give enough
suggestions
do we then try higher distances.
I think this is a good approach here, because we are getting levenshtein
directly,
so we don't have the problem the n-gram based spellchecker has... (for
reference below)
{noformat}
* <p>As the Lucene similarity that is used to fetch the most relevant
n-grammed terms
* is not the same as the edit distance strategy used to calculate the best
* matching spell-checked word from the hits that Lucene found, one usually
has
* to retrieve a couple of numSug's in order to get the true best match.
*
* <p>I.e. if numSug == 1, don't count on that suggestion being the best one.
* Thus, you should set this value to <b>at least</b> 5 for a good suggestion.
{noformat}
Since we are actually doing levenshtein, you can safely use lower values for
numSug,
such as numSug=1
> automaton spellchecker
> ----------------------
>
> Key: LUCENE-2507
> URL: https://issues.apache.org/jira/browse/LUCENE-2507
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries
> this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an
> algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the
> index, then we wouldn't need
> a separate index to rebuild.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]