It occurs to me that the proper static initializer code might well be able to generate distances of 3, 4 or whatever, without bloating the jar.
Nevertheless, the real question of import to me right now is: what “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? The maximum “minimumDistance” the query accepts is 1.0. Karl From: ext Robert Muir [mailto:[email protected]] Sent: Friday, July 23, 2010 11:22 AM To: [email protected] Subject: Re: LevenshteinFilter proposal Well, there are two main things involved: 1. number of terms seeked to in the term enum 2. expense of the comparison itself. one challenge is the construction of a DFA LevK(x) that recognizes all words within edit distance <= k of x is an expensive operation. This is because of the nature of the computation (traditionally its dynamic programming with nasty runtime). We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to offline the nasty part so its linear time: O(n) where n is the length of the query term. But these offline-computated tables get pretty large quick, and so we only include 1,2 to keep the lucene jar file down. additionally for n > 2 it starts to be a ton of termenum seeks... at some of these high distances its faster to just next() thru the terms and compare. we might be able to speed it up more by including say a table for n=3 (at the expense of increased jar size), but not using this n=3 table for #1, only to speed up #2. but its diminishing returns here... On Fri, Jul 23, 2010 at 11:09 AM, <[email protected]<mailto:[email protected]>> wrote: Glad I asked. I would think that the automaton would be superior even for larger edit distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I don’t understand something. ;-) Karl From: ext Robert Muir [mailto:[email protected]<mailto:[email protected]>] Sent: Friday, July 23, 2010 11:05 AM To: [email protected]<mailto:[email protected]> Subject: Re: LevenshteinFilter proposal this is actually done in trunk. In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses automaton. for higher distances it uses the crappy "brute force" method. but, higher distances still get accelerated if you use a reasonable 'maxExpansions' to FuzzyQuery... the default is quite bad (1024). On Fri, Jul 23, 2010 at 10:59 AM, <[email protected]<mailto:[email protected]>> wrote: Thanks! FuzzyQuery will do for my purposes, for the interim. But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no? I understand that this would be a trunk project. Karl From: ext Uwe Schindler [mailto:[email protected]<mailto:[email protected]>] Sent: Friday, July 23, 2010 10:45 AM To: [email protected]<mailto:[email protected]> Subject: RE: LevenshteinFilter proposal Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(…)) ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de<http://www.thetaphi.de/> eMail: [email protected]<mailto:[email protected]> From: [email protected]<mailto:[email protected]> [mailto:[email protected]<mailto:[email protected]>] Sent: Friday, July 23, 2010 4:25 PM To: [email protected]<mailto:[email protected]> Subject: LevenshteinFilter proposal Hi Folks, I’m very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don’t see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don’t know of anyone who has built one. Do you? If not, I’m willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up when I look for it in the javadocs for Lucene – can you point me in the correct direction? Thanks! Karl -- Robert Muir [email protected]<mailto:[email protected]> -- Robert Muir [email protected]<mailto:[email protected]>
