Hello, all - I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA edit distance up to 2) search for many (10k+) strings (knowledge base: kb) in many large strings (docs).
Approach I was thinking of: create Levenshtein FST with all paths associated with unedited form for each kb key, union all into single fst, search docs for matches in fst in style of SynonymFilter. * I created 10k Levenshtein automata from kb keys and unioned them, so that seems tractable (took 1 minute, ~250MB ram) * SynonymFilter code worked fine to associate output and record match token length. * Saw how FuzzySuggester created Levenshtein automata from query/lookup key and intersected that with kb-like fst. I don't see how to create Levenshtein FSTs (vs automatons) associating outputs with unedited form, and union'ing them together. Is this a bad idea? Maybe better idea? Thanks in advance, - Luke