This sounds ambitious ;) Note that Lucene has Levenshtein automata (not FSTs).
Can't you just run an AutomatonQuery on your index once you have unioned all Levenshtein automata into a single one? Or is the reason that you want to convert to an FST so you can associate which unedited form(s) a given document matched? Mike McCandless http://blog.mikemccandless.com On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lne...@gmail.com> wrote: > Hello, all - > > I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA edit > distance up to 2) search for many (10k+) strings (knowledge base: kb) in > many large strings (docs). > > Approach I was thinking of: create Levenshtein FST with all paths > associated with unedited form for each kb key, union all into single fst, > search docs for matches in fst in style of SynonymFilter. > > * I created 10k Levenshtein automata from kb keys and unioned them, so > that seems tractable (took 1 minute, ~250MB ram) > * SynonymFilter code worked fine to associate output and record match token > length. > * Saw how FuzzySuggester created Levenshtein automata from query/lookup key > and intersected that with kb-like fst. > > I don't see how to create Levenshtein FSTs (vs automatons) associating > outputs with unedited form, and union'ing them together. > > Is this a bad idea? Maybe better idea? > > Thanks in advance, > - Luke >