Oof, sounds too tricky for me to justify pursuing right now. While union'ing 10k Levenshtein automata was tractable, seems determinizing the result is not (NP-hard - oops :)), let alone working out a suitably useful conversion to an FST.
Thank you very much for input! Kind regards, - Luke On Tue, May 24, 2016 at 6:19 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > In theory if you could construct an FST from your union'd Levenshtein > automata in the right format for SynonymFilter then you could run it using > that FST, and it would give you the start/end tokens for each match, and > you'd have the output for each match be the original (unedited) terms. > > But I think information was already lost when you did the initial union, > i.e. which original term to output, on which arc (s) in the automaton. > > Also you'd have to at least determinize (and maybe you want to minimize) > the union'd automata since FST cannot represent non-deterministic machines. > > Possibly you could determinize, reconstruct the lost information, by > walking the automaton for each original word, intersecting the Levenshtein > automaton for that word, and recording the first arcs you hit that has one > unique original word as its output, and placing outputs on those arcs, and > then doing a "rote" conversion to the syn filter's FST format. This part > sounds tricky :) > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lne...@gmail.com> wrote: > > > On Tuesday, May 24, 2016, Michael McCandless <luc...@mikemccandless.com > > <javascript:_e(%7B%7D,'cvml','luc...@mikemccandless.com');>> wrote: > > > > > This sounds ambitious ;) > > > > > > That's why I was hoping for some advice from y'all :) > > > > > > > > Note that Lucene has Levenshtein automata (not FSTs). > > > > > > Right, but I thought it might not be too big a leap. > > > > > > > > Can't you just run an AutomatonQuery on your index once you have > unioned > > > all Levenshtein automata into a single one? > > > > > > Or is the reason that you want to convert to an FST so you can > associate > > > which unedited form(s) a given document matched? > > > > > > Correct, I want the unedited form(s) as well as the match character > offsets > > of each match in each document. > > > > > > > > Mike McCandless > > > > > > http://blog.mikemccandless.com > > > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lne...@gmail.com> wrote: > > > > > > > Hello, all - > > > > > > > > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy > (OSA > > > edit > > > > distance up to 2) search for many (10k+) strings (knowledge base: kb) > > in > > > > many large strings (docs). > > > > > > > > Approach I was thinking of: create Levenshtein FST with all paths > > > > associated with unedited form for each kb key, union all into single > > fst, > > > > search docs for matches in fst in style of SynonymFilter. > > > > > > > > * I created 10k Levenshtein automata from kb keys and unioned them, > so > > > > that seems tractable (took 1 minute, ~250MB ram) > > > > * SynonymFilter code worked fine to associate output and record match > > > token > > > > length. > > > > * Saw how FuzzySuggester created Levenshtein automata from > query/lookup > > > key > > > > and intersected that with kb-like fst. > > > > > > > > I don't see how to create Levenshtein FSTs (vs automatons) > associating > > > > outputs with unedited form, and union'ing them together. > > > > > > > > Is this a bad idea? Maybe better idea? > > > > > > > > Thanks in advance, > > > > - Luke > > > > > > > > > >