Wow, that was quick! Thanks! I don't think we'll have too many terms per query term - as I said earlier, we're restricting the expansions to those with an edit distance of 1. But this looks cool anyway.
On 28 Feb 2012, at 16:01, Dawid Weiss wrote: > The issue has a patch -- feel free to try it out. > > Dawid > > On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss <dawid.we...@gmail.com> wrote: >> I filed an issue for that. >> https://issues.apache.org/jira/browse/LUCENE-3832 >> >> I'll try to port it myself actually. It shouldn't be a big problem. >> >> Dawid >> >> On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless >> <luc...@mikemccandless.com> wrote: >>> Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for >>> the insert/delete/transposition changes... >>> >>> Is the number of edits smallish? Ie you're not concerned about >>> combinatoric explosion of step 1? >>> >>> For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use >>> BasicAutomata.makeString(String) on each of your expanded terms, then >>> BasicOperations.union on all of those automata to make a single >>> automaton accepting all your expanded terms, then likely call >>> .determinize() on the resulting automaton (maybe also .minimize() but >>> I think that may not help). Then pass that automaton to AQ. >>> >>> We don't yet have a way to drive a query from an FST, but that would >>> be an interesting addition. EG you could then support weights as >>> well, to decide how the terms are scored (if certain OCR errors are >>> more likely than others). >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Tue, Feb 28, 2012 at 7:33 AM, Alan Woodward >>> <alan.woodw...@romseysoftware.co.uk> wrote: >>>> Hello, >>>> >>>> I'm trying to create a Lucene Query that will take a term and expand it to >>>> include common OCR errors (for example, 'cl' is often misread as 'd', so a >>>> search for 'clog' should also hit 'dog'). My plan is to do this by >>>> generating all the possible variants of a term, using an existing list of >>>> errors, and then somehow mapping this into an AutomatonQuery. I've been >>>> looking around the o.a.l.util.automaton and o.a.l.util.fst packages on >>>> trunk, and I *think* that this is possible, but I'm so far failing to work >>>> out how to put the various bits together. >>>> >>>> I'm thinking it should work like this: >>>> 1) expand query term to sorted list of possible matches >>>> 2) create an FST over those matches >>>> 3) plug this FST into an AutomatonQuery subclass. >>>> >>>> 1) is easy. It's 2) and 3) I'm having trouble with. >>>> >>>> All help gratefully received! >>>> >>>> Thanks, >>>> >>>> Alan Woodward >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org