Just a thought: edit distance is meant for overcoming spelling errors in
form of assimilations or mistype. In your case there is a limited number
of cases that need special care, and you can actually define most of
them pretty well - hence edit distance is by definition much more than
you actually need.
Since all that is necessary here is to identify known spelling
differences (or errors), perhaps you could use some of what I call
"tolerated lookup" using autometa on a Lucene index. Such a lookup is
what I use in HebMorph[1] to find words in a dictionary of Hebrew words,
where spelling is ALWAYS like what you're having with street names. I
use that on a radix, but the idea could be adapted for what you're
looking for fairly easily.
The idea is quite simple - first you do an exact lookup on your
dictionary (TermQuery in your case). If you hit no results, you do a
tolerant lookup, where a "tolerator function" is being consulted with
before going on to the next leaf. Those functions decide whether or not
to allow this move based on a set of rules (position, characters before
and after, etc).
You can see some examples for the Hebrew language here (dot-net, but
should still be readable for Java ppl):
http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/DataStructures/DictRadix.cs
-- radix implementation with tolerant lookup (TolerantLookupCrawler)
http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/LookupTolerators.cs
-- tolerator functions
Tolerant lookup on my radix is very fast, should be the same for a
Lucene index.
Itamar.
[1] www.code972.com/blog/hebmorph/ <http://www.code972.com/blog/hebmorph/>
On 26/7/2010 8:56 PM, Robert Muir wrote:
Nah, its an analyzer. so you can just use termquery (fast: exact match).
at query and index time it just maps stuff to a key... typically you
would just put this in a separate field.
you can combine this with your edit distance query with a
booleanquery, for example the edit distance can handle your
le[o]minster just fine.
I think this would be much better for you, i wouldnt abuse levenshtein
for phonetics stuff, its not designed for that.
On Mon, Jul 26, 2010 at 1:44 PM, <[email protected]
<mailto:[email protected]>> wrote:
Clearly you haven’t been in the Northeast much. Try “Worcester”
vs. “wuster”, or “Leominster” vs. “leminster”. It’s also likely
to be a challenge to come up with the right phonetics for any
given proper location name. It’s even worse in Britain, or
countries where the phonetic rules may be a hodgepodge of
different colonial influences.
That having been said, if there exists a “PhoneticQuery” object
that does all this using the automaton logic under the covers, I
think it would be worth a serious look.
Karl
*From:* ext Robert Muir [mailto:[email protected]
<mailto:[email protected]>]
*Sent:* Monday, July 26, 2010 1:24 PM
*To:* [email protected] <mailto:[email protected]>
*Subject:* Re: LevenshteinFilter proposal
On Mon, Jul 26, 2010 at 1:13 PM, <[email protected]
<mailto:[email protected]>> wrote:
What I want to capture is situations where people misspell things
in roughly a phonetic way. For example, “Tchaikovsky Avenue”
might be misspelled as “Chicovsky Avenue”. Modules that do
phonetic mapping are possible but you’d have to somehow generate a
phonetic database of (say) streetnames, worldwide. Good luck on
getting hold of that kind of data anywhere. ;-) In the absence of
such data, an LD distance will have to do – but it will almost
certainly need to be greater than 2.
I added this to 'TestPhoneticFilter' and it
passes: assertAlgorithm(new DoubleMetaphone(), false,
"Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });
So if you want to give me all your street names, i can sell you a
phonetic database, or you can use the filters in
modules/analyzers/phonetic, which have a bunch of different
configurable algorithms :)
--
Robert Muir
[email protected] <mailto:[email protected]>
--
Robert Muir
[email protected] <mailto:[email protected]>