[jira] Updated: (LUCENE-2507) automaton spellchecker

Robert Muir (JIRA) Tue, 28 Sep 2010 10:45:58 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-2507:
--------------------------------

    Attachment: LUCENE-2507.patch

we have sped up this seeking a lot recently, and i improved this patch some:
* avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq
* Mike had the idea that we should actually try lower edit distances first. The 
  general use case here is a small number of suggestions (e.g. 1), so 
  we actually try edit distance=1 first... only if this doesn't give enough 
suggestions 
  do we then try higher distances. 

I think this is a good approach here, because we are getting levenshtein 
directly, 
so we don't have the problem the n-gram based spellchecker has... (for 
reference below)

{noformat}
   * <p>As the Lucene similarity that is used to fetch the most relevant 
n-grammed terms
   * is not the same as the edit distance strategy used to calculate the best
   * matching spell-checked word from the hits that Lucene found, one usually 
has
   * to retrieve a couple of numSug's in order to get the true best match.
   *
   * <p>I.e. if numSug == 1, don't count on that suggestion being the best one.
   * Thus, you should set this value to <b>at least</b> 5 for a good suggestion.
{noformat}

Since we are actually doing levenshtein, you can safely use lower values for 
numSug,
such as numSug=1


> automaton spellchecker
> ----------------------
>
>                 Key: LUCENE-2507
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2507
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries 
> this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an 
> algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the 
> index, then we wouldn't need
> a separate index to rebuild.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2507) automaton spellchecker

Reply via email to