RE: LevenshteinFilter proposal

karl.wright Mon, 26 Jul 2010 09:52:01 -0700

Is the automaton creation code checked in anywhere?
Also, if it’s not possible to generate the automaton table on the fly, perhaps 
it could be included in an optional jar (whose existence would be queried by 
reflection).  The reason I’m interested in LD’s of three, especially, is 
because I can well imagine phonetic misspellings that would have distances on 
that order.


Karl


From: ext Robert Muir [mailto:[email protected]]
Sent: Monday, July 26, 2010 11:49 AM
To: [email protected]
Subject: Re: LevenshteinFilter proposal

I wanted to show you what i mean, just so you know:

here is what 'ant createLevAutomata' does now:
createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]

BUILD SUCCESSFUL
Total time: 4 seconds

I modified it with the patch below to generate 3 and 4, just for kicks, and ran 
it again... it took it 7 minutes to generate lev3, its still running on lev4 
(might take hours or days, i dont actually know)

createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
     [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...

Index: build.xml
===================================================================
--- build.xml   (revision 959288)
+++ build.xml   (working copy)
@@ -533,6 +533,8 @@
   <target name="createLevAutomata" 
depends="check-moman,clone-moman,pull-moman">
        <createLevAutomaton n="1"/>
        <createLevAutomaton n="2"/>
+       <createLevAutomaton n="3"/>
+       <createLevAutomaton n="4"/>
   </target>

   <target name="check-moman">

On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless 
<[email protected]<mailto:[email protected]>> wrote:
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir 
<[email protected]<mailto:[email protected]>> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, 
> <[email protected]<mailto:[email protected]>> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)

>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: [email protected]<mailto:[email protected]>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, 
>> <[email protected]<mailto:[email protected]>> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: [email protected]<mailto:[email protected]>
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, 
>> <[email protected]<mailto:[email protected]>> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no?  I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: [email protected]<mailto:[email protected]>
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: [email protected]<mailto:[email protected]>
>>
>>
>>
>> From: [email protected]<mailto:[email protected]> 
>> [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: [email protected]<mailto:[email protected]>
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
>> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> [email protected]<mailto:[email protected]>
>>
>>
>> --
>> Robert Muir
>> [email protected]<mailto:[email protected]>
>
>
> --
> Robert Muir
> [email protected]<mailto:[email protected]>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>



--
Robert Muir
[email protected]<mailto:[email protected]>

RE: LevenshteinFilter proposal

Reply via email to