RE: LevenshteinFilter proposal

karl.wright Mon, 26 Jul 2010 08:31:02 -0700

It occurs to me that the proper static initializer code might well be able to 
generate distances of 3, 4 or whatever, without bloating the jar.


Nevertheless, the real question of import to me right now is: what 
“minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The 
maximum “minimumDistance” the query accepts is 1.0.

Karl


From: ext Robert Muir [mailto:[email protected]]
Sent: Friday, July 23, 2010 11:22 AM
To: [email protected]
Subject: Re: LevenshteinFilter proposal

Well, there are two main things involved:
1. number of terms seeked to in the term enum
2. expense of the comparison itself.

one challenge is the construction of a DFA LevK(x) that recognizes all words 
within edit distance <= k of x is an expensive operation.
This is because of the nature of the computation (traditionally its dynamic 
programming with nasty runtime).

We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to 
offline the nasty part so its linear time: O(n) where n is the length of the 
query term.
But these offline-computated tables get pretty large quick, and so we only 
include 1,2 to keep the lucene jar file down.

additionally for n > 2 it starts to be a ton of termenum seeks... at some of 
these high distances its faster to just next() thru the terms and compare.

we might be able to speed it up more by including say a table for n=3 (at the 
expense of increased jar size), but not using this n=3 table for #1, only to 
speed up #2. but its diminishing returns here...

On Fri, Jul 23, 2010 at 11:09 AM, 
<[email protected]<mailto:[email protected]>> wrote:
Glad I asked.

I would think that the automaton would be superior even for larger edit 
distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I 
don’t understand something. ;-)

Karl


From: ext Robert Muir [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, July 23, 2010 11:05 AM

To: [email protected]<mailto:[email protected]>
Subject: Re: LevenshteinFilter proposal

this is actually done in trunk.

In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses 
automaton.

for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable 
'maxExpansions' to FuzzyQuery... the default is quite bad (1024).


On Fri, Jul 23, 2010 at 10:59 AM, 
<[email protected]<mailto:[email protected]>> wrote:
Thanks!

FuzzyQuery will do for my purposes, for the interim.  But I suspect that 
FuzzyQuery could be made a lot more efficient if it were rebuilt on top of 
Automaton, no?  I understand that this would be a trunk project.

Karl


From: ext Uwe Schindler [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, July 23, 2010 10:45 AM

To: [email protected]<mailto:[email protected]>
Subject: RE: LevenshteinFilter proposal

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use 
MultiTermQueryWrapperFilter(new FuzzyQuery(…))

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: [email protected]<mailto:[email protected]>

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, July 23, 2010 4:25 PM
To: [email protected]<mailto:[email protected]>
Subject: LevenshteinFilter proposal

Hi Folks,

I’m very interested in using (or developing!) a Levenshtein Filter within the 
family of Solr Filter objects. I don’t see such a class today anywhere. I see 
how the AutomatonQuery object would permit such a thing to be built, but to 
date I don’t know of anyone who has built one. Do you?  If not, I’m willing to 
give it a whirl.  Also, AutomatonQuery doesn’t seem to come up when I look for 
it in the javadocs for Lucene – can you point me in the correct direction?
Thanks!
Karl





--
Robert Muir
[email protected]<mailto:[email protected]>



--
Robert Muir
[email protected]<mailto:[email protected]>

RE: LevenshteinFilter proposal

Reply via email to