Re: LevenshteinFilter proposal

Itamar Syn-Hershko Mon, 26 Jul 2010 11:22:34 -0700

Just a thought: edit distance is meant for overcoming spelling errors inform of assimilations or mistype. In your case there is a limited numberof cases that need special care, and you can actually define most ofthem pretty well - hence edit distance is by definition much more thanyou actually need.

Since all that is necessary here is to identify known spellingdifferences (or errors), perhaps you could use some of what I call"tolerated lookup" using autometa on a Lucene index. Such a lookup iswhat I use in HebMorph[1] to find words in a dictionary of Hebrew words,where spelling is ALWAYS like what you're having with street names. Iuse that on a radix, but the idea could be adapted for what you'relooking for fairly easily.

The idea is quite simple - first you do an exact lookup on yourdictionary (TermQuery in your case). If you hit no results, you do atolerant lookup, where a "tolerator function" is being consulted withbefore going on to the next leaf. Those functions decide whether or notto allow this move based on a set of rules (position, characters beforeand after, etc).

You can see some examples for the Hebrew language here (dot-net, butshould still be readable for Java ppl):

http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/DataStructures/DictRadix.cs-- radix implementation with tolerant lookup (TolerantLookupCrawler)http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/LookupTolerators.cs-- tolerator functions

Tolerant lookup on my radix is very fast, should be the same for aLucene index.


Itamar.

[1] www.code972.com/blog/hebmorph/ <http://www.code972.com/blog/hebmorph/>

On 26/7/2010 8:56 PM, Robert Muir wrote:

Nah, its an analyzer. so you can just use termquery (fast: exact match).

at query and index time it just maps stuff to a key... typically youwould just put this in a separate field.

you can combine this with your edit distance query with abooleanquery, for example the edit distance can handle yourle[o]minster just fine.

I think this would be much better for you, i wouldnt abuse levenshteinfor phonetics stuff, its not designed for that.

On Mon, Jul 26, 2010 at 1:44 PM, <[email protected]<mailto:[email protected]>> wrote:


    Clearly you haven’t been in the Northeast much.  Try “Worcester”
    vs. “wuster”, or “Leominster” vs. “leminster”.  It’s also likely
    to be a challenge to come up with the right phonetics for any
    given proper location name.   It’s even worse in Britain, or
    countries where the phonetic rules may be a hodgepodge of
    different colonial influences.

    That having been said, if there exists a “PhoneticQuery” object
    that does all this using the automaton logic under the covers, I
    think it would be  worth a serious look.

    Karl

    *From:* ext Robert Muir [mailto:[email protected]
    <mailto:[email protected]>]
    *Sent:* Monday, July 26, 2010 1:24 PM


    *To:* [email protected] <mailto:[email protected]>
    *Subject:* Re: LevenshteinFilter proposal

    On Mon, Jul 26, 2010 at 1:13 PM, <[email protected]
    <mailto:[email protected]>> wrote:

    What I want to capture is situations where people misspell things
    in roughly a phonetic way.  For example, “Tchaikovsky Avenue”
    might be misspelled as “Chicovsky Avenue”.  Modules that do
    phonetic mapping are possible but you’d have to somehow generate a
    phonetic database of (say) streetnames, worldwide.  Good luck on
    getting hold of that kind of data anywhere. ;-)  In the absence of
    such data, an LD distance will have to do – but it will almost
    certainly need to be greater than 2.

    I added this to 'TestPhoneticFilter' and it
    passes:  assertAlgorithm(new DoubleMetaphone(), false,
    "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });

    So if you want to give me all your street names, i can sell you a
    phonetic database, or you can use the filters in
    modules/analyzers/phonetic, which have a bunch of different
    configurable algorithms :)

--Robert Muir

    [email protected] <mailto:[email protected]>




--
Robert Muir
[email protected] <mailto:[email protected]>

Re: LevenshteinFilter proposal

Reply via email to