On Tuesday, May 24, 2016, Michael McCandless <luc...@mikemccandless.com <javascript:_e(%7B%7D,'cvml','luc...@mikemccandless.com');>> wrote:
> This sounds ambitious ;) That's why I was hoping for some advice from y'all :) > > Note that Lucene has Levenshtein automata (not FSTs). Right, but I thought it might not be too big a leap. > > Can't you just run an AutomatonQuery on your index once you have unioned > all Levenshtein automata into a single one? > > Or is the reason that you want to convert to an FST so you can associate > which unedited form(s) a given document matched? Correct, I want the unedited form(s) as well as the match character offsets of each match in each document. > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lne...@gmail.com> wrote: > > > Hello, all - > > > > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA > edit > > distance up to 2) search for many (10k+) strings (knowledge base: kb) in > > many large strings (docs). > > > > Approach I was thinking of: create Levenshtein FST with all paths > > associated with unedited form for each kb key, union all into single fst, > > search docs for matches in fst in style of SynonymFilter. > > > > * I created 10k Levenshtein automata from kb keys and unioned them, so > > that seems tractable (took 1 minute, ~250MB ram) > > * SynonymFilter code worked fine to associate output and record match > token > > length. > > * Saw how FuzzySuggester created Levenshtein automata from query/lookup > key > > and intersected that with kb-like fst. > > > > I don't see how to create Levenshtein FSTs (vs automatons) associating > > outputs with unedited form, and union'ing them together. > > > > Is this a bad idea? Maybe better idea? > > > > Thanks in advance, > > - Luke > > >