Point taken, but I wonder if there's an algorithmic shortcut to determinize the union of Levenshtein DFAs...
Here's some stats on unioning n Levenshtein automata (2-edit, with transpositions, of names) together (not attempting to determinize them): * n=1000 numStates: 337,553 numTransitions: 1,577,237 (Xmx1g OK) * n=5000 numStates: 1,620,944 numTransitions: 7,669,610 (Xmx1g fails, Xmx2g OK) * n=10000 numStates: 3,248,802 numTransitions: 15,377,814 (Xmx2g fails, Xmx3g OK) On Thu, May 26, 2016 at 12:13 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > But how many states does the not-yet-determinized union of 5000+ > Levenshtein automata contain? > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, May 26, 2016 at 12:08 PM, Luke Nezda <lne...@gmail.com> wrote: > > > I should note, I know in I can > > call Operations.determinize(union, 10_000_000) but union of 5000+ > > Levenshtein automata seems to require too many states to be tractable, > and > > that's on the low end of what I'd like to work with. > > > > On Thu, May 26, 2016 at 9:59 AM, Luke Nezda <lne...@gmail.com> wrote: > > > > > I was surprised union of deterministic automata wasn't deterministic > too, > > > but it would appear so: > > > > > > public void testUnionOfDeterministicsIsDeterministic() { > > > Automaton mud = new LevenshteinAutomata("mud", true).toAutomaton(1); > > > Automaton mad = new LevenshteinAutomata("mad", true).toAutomaton(1); > > > assertTrue(mud.isDeterministic()); > > > assertTrue(mad.isDeterministic()); > > > > > > Automaton union = Operations.union(mud, mad); > > > assertTrue(union.isDeterministic()); // fails > > > } > > > > > > > > > Maybe I have a bad embedded assumption? Or maybe there's an interesting > > > opportunity to make union operation try harder to make a deterministic > > > result? > > > > > > On Wed, May 25, 2016 at 3:07 PM, Michael McCandless < > > > luc...@mikemccandless.com> wrote: > > > > > >> Hmm it's curious you found determinization to be so costly since the > > >> Levenshtein automata are already determinized. > > >> > > >> But, yeah, converting THAT to an FST is tricky... > > >> > > >> Mike McCandless > > >> > > >> http://blog.mikemccandless.com > > >> > > >> On Wed, May 25, 2016 at 2:46 PM, Luke Nezda <lne...@gmail.com> wrote: > > >> > > >> > Oof, sounds too tricky for me to justify pursuing right now. While > > >> > union'ing 10k Levenshtein automata was tractable, seems > determinizing > > >> the > > >> > result is not (NP-hard - oops :)), let alone working out a suitably > > >> useful > > >> > conversion to an FST. > > >> > > > >> > Thank you very much for input! > > >> > > > >> > Kind regards, > > >> > - Luke > > >> > > > >> > On Tue, May 24, 2016 at 6:19 PM, Michael McCandless < > > >> > luc...@mikemccandless.com> wrote: > > >> > > > >> > > In theory if you could construct an FST from your union'd > > Levenshtein > > >> > > automata in the right format for SynonymFilter then you could run > it > > >> > using > > >> > > that FST, and it would give you the start/end tokens for each > match, > > >> and > > >> > > you'd have the output for each match be the original (unedited) > > terms. > > >> > > > > >> > > But I think information was already lost when you did the initial > > >> union, > > >> > > i.e. which original term to output, on which arc (s) in the > > automaton. > > >> > > > > >> > > Also you'd have to at least determinize (and maybe you want to > > >> minimize) > > >> > > the union'd automata since FST cannot represent non-deterministic > > >> > machines. > > >> > > > > >> > > Possibly you could determinize, reconstruct the lost information, > by > > >> > > walking the automaton for each original word, intersecting the > > >> > Levenshtein > > >> > > automaton for that word, and recording the first arcs you hit that > > has > > >> > one > > >> > > unique original word as its output, and placing outputs on those > > arcs, > > >> > and > > >> > > then doing a "rote" conversion to the syn filter's FST format. > This > > >> part > > >> > > sounds tricky :) > > >> > > > > >> > > Mike McCandless > > >> > > > > >> > > http://blog.mikemccandless.com > > >> > > > > >> > > On Tue, May 24, 2016 at 9:07 AM, Luke Nezda <lne...@gmail.com> > > wrote: > > >> > > > > >> > > > On Tuesday, May 24, 2016, Michael McCandless < > > >> > luc...@mikemccandless.com > > >> > > > <javascript:_e(%7B%7D,'cvml','luc...@mikemccandless.com');>> > > wrote: > > >> > > > > > >> > > > > This sounds ambitious ;) > > >> > > > > > >> > > > > > >> > > > That's why I was hoping for some advice from y'all :) > > >> > > > > > >> > > > > > > >> > > > > Note that Lucene has Levenshtein automata (not FSTs). > > >> > > > > > >> > > > > > >> > > > Right, but I thought it might not be too big a leap. > > >> > > > > > >> > > > > > > >> > > > > Can't you just run an AutomatonQuery on your index once you > have > > >> > > unioned > > >> > > > > all Levenshtein automata into a single one? > > >> > > > > > > >> > > > > Or is the reason that you want to convert to an FST so you can > > >> > > associate > > >> > > > > which unedited form(s) a given document matched? > > >> > > > > > >> > > > > > >> > > > Correct, I want the unedited form(s) as well as the match > > character > > >> > > offsets > > >> > > > of each match in each document. > > >> > > > > > >> > > > > > > >> > > > > Mike McCandless > > >> > > > > > > >> > > > > http://blog.mikemccandless.com > > >> > > > > > > >> > > > > On Mon, May 23, 2016 at 8:59 PM, Luke Nezda <lne...@gmail.com > > > > >> > wrote: > > >> > > > > > > >> > > > > > Hello, all - > > >> > > > > > > > >> > > > > > I'd like to use Lucene's automaton/FST code to achieve fast > > >> fuzzy > > >> > > (OSA > > >> > > > > edit > > >> > > > > > distance up to 2) search for many (10k+) strings (knowledge > > >> base: > > >> > kb) > > >> > > > in > > >> > > > > > many large strings (docs). > > >> > > > > > > > >> > > > > > Approach I was thinking of: create Levenshtein FST with all > > >> paths > > >> > > > > > associated with unedited form for each kb key, union all > into > > >> > single > > >> > > > fst, > > >> > > > > > search docs for matches in fst in style of SynonymFilter. > > >> > > > > > > > >> > > > > > * I created 10k Levenshtein automata from kb keys and > unioned > > >> > them, > > >> > > so > > >> > > > > > that seems tractable (took 1 minute, ~250MB ram) > > >> > > > > > * SynonymFilter code worked fine to associate output and > > record > > >> > match > > >> > > > > token > > >> > > > > > length. > > >> > > > > > * Saw how FuzzySuggester created Levenshtein automata from > > >> > > query/lookup > > >> > > > > key > > >> > > > > > and intersected that with kb-like fst. > > >> > > > > > > > >> > > > > > I don't see how to create Levenshtein FSTs (vs automatons) > > >> > > associating > > >> > > > > > outputs with unedited form, and union'ing them together. > > >> > > > > > > > >> > > > > > Is this a bad idea? Maybe better idea? > > >> > > > > > > > >> > > > > > Thanks in advance, > > >> > > > > > - Luke > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >