On Mon, Jul 26, 2010 at 12:51 PM, <[email protected]> wrote: > Is the automaton creation code checked in anywhere? >
yes, its generated from lucene/build.xml (createLevAutomata task) > Also, if it’s not possible to generate the automaton table on the fly, > perhaps it could be included in an optional jar (whose existence would be > queried by reflection). The reason I’m interested in LD’s of three, > especially, is because I can well imagine phonetic misspellings that would > have distances on that order. > I really wouldnt recommend this (though i am not sure what exactly you are trying to do), instead i would recommend analyzers/phonetic to deal with the phonetic stuff. > > > Karl > > > > > > *From:* ext Robert Muir [mailto:[email protected]] > *Sent:* Monday, July 26, 2010 11:49 AM > > *To:* [email protected] > *Subject:* Re: LevenshteinFilter proposal > > > > I wanted to show you what i mean, just so you know: > > > > here is what 'ant createLevAutomata' does now: > > createLevAutomata: > > [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB] > > [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB] > > > > BUILD SUCCESSFUL > > Total time: 4 seconds > > > > I modified it with the patch below to generate 3 and 4, just for kicks, and > ran it again... it took it 7 minutes to generate lev3, its still running on > lev4 (might take hours or days, i dont actually know) > > > > createLevAutomata: > > [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB] > > [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB] > > [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB] > > ... > > > > Index: build.xml > > =================================================================== > > --- build.xml (revision 959288) > > +++ build.xml (working copy) > > @@ -533,6 +533,8 @@ > > <target name="createLevAutomata" > depends="check-moman,clone-moman,pull-moman"> > > <createLevAutomaton n="1"/> > > <createLevAutomaton n="2"/> > > + <createLevAutomaton n="3"/> > > + <createLevAutomaton n="4"/> > > </target> > > > > <target name="check-moman"> > > > > On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless < > [email protected]> wrote: > > On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <[email protected]> wrote: > > On Mon, Jul 26, 2010 at 11:30 AM, <[email protected]> wrote: > >> > >> It occurs to me that the proper static initializer code might well be > able > >> to generate distances of 3, 4 or whatever, without bloating the jar. > > > > I disagree: generating these distances is a non-trivial algorithm and > takes > > minutes or hours to compute... > > Furthermore, our own implementation for this algo is in Python :) > > > >> Nevertheless, the real question of import to me right now is: what > >> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? > The > >> maximum “minimumDistance” the query accepts is 1.0. > > > > right, fuzzyquery uses a strange formula that doesnt tie directly to edit > > distance (unless you factor in length). > > you can do this easier like this: > > LevenshteinAutomata builder = new LevenshteinAutomata("foobar"); > > Automaton ed2 = builder.toAutomaton(2); > > // the term doesnt really matter, just for the field name and toString, > etc. > > AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"), > > ed2); > > > >> > >> > >> > >> Karl > >> > >> > >> > >> > >> > >> From: ext Robert Muir [mailto:[email protected]] > >> Sent: Friday, July 23, 2010 11:22 AM > >> > >> To: [email protected] > >> Subject: Re: LevenshteinFilter proposal > >> > >> > >> > >> Well, there are two main things involved: > >> > >> 1. number of terms seeked to in the term enum > >> > >> 2. expense of the comparison itself. > >> > >> > >> > >> one challenge is the construction of a DFA LevK(x) that recognizes all > >> words within edit distance <= k of x is an expensive operation. > >> > >> This is because of the nature of the computation (traditionally its > >> dynamic programming with nasty runtime). > >> > >> > >> > >> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 > to > >> offline the nasty part so its linear time: O(n) where n is the length of > the > >> query term. > >> > >> But these offline-computated tables get pretty large quick, and so we > only > >> include 1,2 to keep the lucene jar file down. > >> > >> > >> > >> additionally for n > 2 it starts to be a ton of termenum seeks... at > some > >> of these high distances its faster to just next() thru the terms and > >> compare. > >> > >> > >> > >> we might be able to speed it up more by including say a table for n=3 > (at > >> the expense of increased jar size), but not using this n=3 table for #1, > >> only to speed up #2. but its diminishing returns here... > >> > >> > >> > >> On Fri, Jul 23, 2010 at 11:09 AM, <[email protected]> wrote: > >> > >> Glad I asked. > >> > >> > >> > >> I would think that the automaton would be superior even for larger edit > >> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe > I > >> don’t understand something. ;-) > >> > >> > >> > >> Karl > >> > >> > >> > >> > >> > >> From: ext Robert Muir [mailto:[email protected]] > >> Sent: Friday, July 23, 2010 11:05 AM > >> > >> To: [email protected] > >> > >> Subject: Re: LevenshteinFilter proposal > >> > >> > >> > >> this is actually done in trunk. > >> > >> > >> > >> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses > >> automaton. > >> > >> > >> > >> for higher distances it uses the crappy "brute force" method. > >> > >> but, higher distances still get accelerated if you use a reasonable > >> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024). > >> > >> > >> > >> > >> > >> On Fri, Jul 23, 2010 at 10:59 AM, <[email protected]> wrote: > >> > >> Thanks! > >> > >> > >> > >> FuzzyQuery will do for my purposes, for the interim. But I suspect that > >> FuzzyQuery could be made a lot more efficient if it were rebuilt on top > of > >> Automaton, no? I understand that this would be a trunk project. > >> > >> > >> > >> Karl > >> > >> > >> > >> > >> > >> From: ext Uwe Schindler [mailto:[email protected]] > >> Sent: Friday, July 23, 2010 10:45 AM > >> > >> To: [email protected] > >> > >> Subject: RE: LevenshteinFilter proposal > >> > >> > >> > >> Automaton is only in Lucene/Solr Trunk. To get a filter out of > FuzzyQuery, > >> use MultiTermQueryWrapperFilter(new FuzzyQuery(…)) > >> > >> > >> > >> ----- > >> > >> Uwe Schindler > >> > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> > >> http://www.thetaphi.de > >> > >> eMail: [email protected] > >> > >> > >> > >> From: [email protected] [mailto:[email protected]] > >> Sent: Friday, July 23, 2010 4:25 PM > >> To: [email protected] > >> Subject: LevenshteinFilter proposal > >> > >> > >> > >> Hi Folks, > >> > >> I’m very interested in using (or developing!) a Levenshtein Filter > within > >> the family of Solr Filter objects. I don’t see such a class today > anywhere. > >> I see how the AutomatonQuery object would permit such a thing to be > built, > >> but to date I don’t know of anyone who has built one. Do you? If not, > I’m > >> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come > up > >> when I look for it in the javadocs for Lucene – can you point me in the > >> correct direction? > >> > >> Thanks! > >> Karl > >> > >> > >> > >> > >> > >> > >> -- > >> Robert Muir > >> [email protected] > >> > >> > >> -- > >> Robert Muir > >> [email protected] > > > > > > -- > > Robert Muir > > [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > > > -- > Robert Muir > [email protected] > -- Robert Muir [email protected]
