Re: LevenshteinFilter proposal

Robert Muir Mon, 26 Jul 2010 10:07:58 -0700

On Mon, Jul 26, 2010 at 12:51 PM, <[email protected]> wrote:

>  Is the automaton creation code checked in anywhere?
>


yes, its generated from lucene/build.xml (createLevAutomata task)

> Also, if it’s not possible to generate the automaton table on the fly,
> perhaps it could be included in an optional jar (whose existence would be
> queried by reflection).  The reason I’m interested in LD’s of three,
> especially, is because I can well imagine phonetic misspellings that would
> have distances on that order.
>

I really wouldnt recommend this (though i am not sure what exactly you are
trying to do), instead i would recommend analyzers/phonetic to deal with the
phonetic stuff.

>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:[email protected]]
> *Sent:* Monday, July 26, 2010 11:49 AM
>
> *To:* [email protected]
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> I wanted to show you what i mean, just so you know:
>
>
>
> here is what 'ant createLevAutomata' does now:
>
> createLevAutomata:
>
>      [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
>
>      [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
>
>
>
> BUILD SUCCESSFUL
>
> Total time: 4 seconds
>
>
>
> I modified it with the patch below to generate 3 and 4, just for kicks, and
> ran it again... it took it 7 minutes to generate lev3, its still running on
> lev4 (might take hours or days, i dont actually know)
>
>
>
> createLevAutomata:
>
>      [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
>
>      [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
>
>      [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
>
> ...
>
>
>
> Index: build.xml
>
> ===================================================================
>
> --- build.xml   (revision 959288)
>
> +++ build.xml   (working copy)
>
> @@ -533,6 +533,8 @@
>
>    <target name="createLevAutomata"
> depends="check-moman,clone-moman,pull-moman">
>
>         <createLevAutomaton n="1"/>
>
>         <createLevAutomaton n="2"/>
>
> +       <createLevAutomaton n="3"/>
>
> +       <createLevAutomaton n="4"/>
>
>    </target>
>
>
>
>    <target name="check-moman">
>
>
>
> On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <
> [email protected]> wrote:
>
> On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <[email protected]> wrote:
> > On Mon, Jul 26, 2010 at 11:30 AM, <[email protected]> wrote:
> >>
> >> It occurs to me that the proper static initializer code might well be
> able
> >> to generate distances of 3, 4 or whatever, without bloating the jar.
> >
> > I disagree: generating these distances is a non-trivial algorithm and
> takes
> > minutes or hours to compute...
>
> Furthermore, our own implementation for this algo is in Python :)
>
>
> >> Nevertheless, the real question of import to me right now is: what
> >> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?
> The
> >> maximum “minimumDistance” the query accepts is 1.0.
> >
> > right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> > distance (unless you factor in length).
> > you can do this easier like this:
> > LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> > Automaton ed2 = builder.toAutomaton(2);
> > // the term doesnt really matter, just for the field name and toString,
> etc.
> > AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> > ed2);
> >
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:[email protected]]
> >> Sent: Friday, July 23, 2010 11:22 AM
> >>
> >> To: [email protected]
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Well, there are two main things involved:
> >>
> >> 1. number of terms seeked to in the term enum
> >>
> >> 2. expense of the comparison itself.
> >>
> >>
> >>
> >> one challenge is the construction of a DFA LevK(x) that recognizes all
> >> words within edit distance <= k of x is an expensive operation.
> >>
> >> This is because of the nature of the computation (traditionally its
> >> dynamic programming with nasty runtime).
> >>
> >>
> >>
> >> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940
>  to
> >> offline the nasty part so its linear time: O(n) where n is the length of
> the
> >> query term.
> >>
> >> But these offline-computated tables get pretty large quick, and so we
> only
> >> include 1,2 to keep the lucene jar file down.
> >>
> >>
> >>
> >> additionally for n > 2 it starts to be a ton of termenum seeks... at
> some
> >> of these high distances its faster to just next() thru the terms and
> >> compare.
> >>
> >>
> >>
> >> we might be able to speed it up more by including say a table for n=3
> (at
> >> the expense of increased jar size), but not using this n=3 table for #1,
> >> only to speed up #2. but its diminishing returns here...
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 11:09 AM, <[email protected]> wrote:
> >>
> >> Glad I asked.
> >>
> >>
> >>
> >> I would think that the automaton would be superior even for larger edit
> >> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe
> I
> >> don’t understand something. ;-)
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:[email protected]]
> >> Sent: Friday, July 23, 2010 11:05 AM
> >>
> >> To: [email protected]
> >>
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> this is actually done in trunk.
> >>
> >>
> >>
> >> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> >> automaton.
> >>
> >>
> >>
> >> for higher distances it uses the crappy "brute force" method.
> >>
> >> but, higher distances still get accelerated if you use a reasonable
> >> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 10:59 AM, <[email protected]> wrote:
> >>
> >> Thanks!
> >>
> >>
> >>
> >> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> >> FuzzyQuery could be made a lot more efficient if it were rebuilt on top
> of
> >> Automaton, no?  I understand that this would be a trunk project.
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Uwe Schindler [mailto:[email protected]]
> >> Sent: Friday, July 23, 2010 10:45 AM
> >>
> >> To: [email protected]
> >>
> >> Subject: RE: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Automaton is only in Lucene/Solr Trunk. To get a filter out of
> FuzzyQuery,
> >> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
> >>
> >>
> >>
> >> -----
> >>
> >> Uwe Schindler
> >>
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>
> >> http://www.thetaphi.de
> >>
> >> eMail: [email protected]
> >>
> >>
> >>
> >> From: [email protected] [mailto:[email protected]]
> >> Sent: Friday, July 23, 2010 4:25 PM
> >> To: [email protected]
> >> Subject: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Hi Folks,
> >>
> >> I’m very interested in using (or developing!) a Levenshtein Filter
> within
> >> the family of Solr Filter objects. I don’t see such a class today
> anywhere.
> >> I see how the AutomatonQuery object would permit such a thing to be
> built,
> >> but to date I don’t know of anyone who has built one. Do you?  If not,
> I’m
> >> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come
> up
> >> when I look for it in the javadocs for Lucene – can you point me in the
> >> correct direction?
> >>
> >> Thanks!
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> [email protected]
> >>
> >>
> >> --
> >> Robert Muir
> >> [email protected]
> >
> >
> > --
> > Robert Muir
> > [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
>
> --
> Robert Muir
> [email protected]
>



-- 
Robert Muir
[email protected]

Re: LevenshteinFilter proposal

Reply via email to