Re: Building FST-like automaton queries

Alan Woodward Tue, 28 Feb 2012 08:31:00 -0800

Wow, that was quick!  Thanks!

I don't think we'll have too many terms per query term - as I said earlier, 
we're restricting the expansions to those with an edit distance of 1.  But this 
looks cool anyway.


On 28 Feb 2012, at 16:01, Dawid Weiss wrote:

> The issue has a patch -- feel free to try it out.
> 
> Dawid
> 
> On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss <dawid.we...@gmail.com> wrote:
>> I filed an issue for that.
>> https://issues.apache.org/jira/browse/LUCENE-3832
>> 
>> I'll try to port it myself actually. It shouldn't be a big problem.
>> 
>> Dawid
>> 
>> On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>>> Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
>>> the insert/delete/transposition changes...
>>> 
>>> Is the number of edits smallish?  Ie you're not concerned about
>>> combinatoric explosion of step 1?
>>> 
>>> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use
>>> BasicAutomata.makeString(String) on each of your expanded terms, then
>>> BasicOperations.union on all of those automata to make a single
>>> automaton accepting all your expanded terms, then likely call
>>> .determinize() on the resulting automaton (maybe also .minimize() but
>>> I think that may not help).  Then pass that automaton to AQ.
>>> 
>>> We don't yet have a way to drive a query from an FST, but that would
>>> be an interesting addition.  EG you could then support weights as
>>> well, to decide how the terms are scored (if certain OCR errors are
>>> more likely than others).
>>> 
>>> Mike McCandless
>>> 
>>> http://blog.mikemccandless.com
>>> 
>>> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward
>>> <alan.woodw...@romseysoftware.co.uk> wrote:
>>>> Hello,
>>>> 
>>>> I'm trying to create a Lucene Query that will take a term and expand it to 
>>>> include common OCR errors (for example, 'cl' is often misread as 'd', so a 
>>>> search for 'clog' should also hit 'dog').  My plan is to do this by 
>>>> generating all the possible variants of a term, using an existing list of 
>>>> errors, and then somehow mapping this into an AutomatonQuery.  I've been 
>>>> looking around the o.a.l.util.automaton and o.a.l.util.fst packages on 
>>>> trunk, and I *think* that this is possible, but I'm so far failing to work 
>>>> out how to put the various bits together.
>>>> 
>>>> I'm thinking it should work like this:
>>>> 1) expand query term to sorted list of possible matches
>>>> 2) create an FST over those matches
>>>> 3) plug this FST into an AutomatonQuery subclass.
>>>> 
>>>> 1) is easy.  It's 2) and 3) I'm having trouble with.
>>>> 
>>>> All help gratefully received!
>>>> 
>>>> Thanks,
>>>> 
>>>> Alan Woodward
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Building FST-like automaton queries

Reply via email to