Wow, 2G heap, that's horrible! How much heap does the automaton itself take?
You can use the automaton's step method to transition from a state given the next input character to another state (or -1 if that state doesn't accept that character); it will be slower than the 2 GB run automaton, but perhaps fast enough? Mike McCandless http://blog.mikemccandless.com On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mann...@gmail.com> wrote: > Thanks Mike for getting back to me, sounds like I'm on the right track. > > I'm building the automaton from around 1.7million strings, and it ends up > with about 3.8million states and it turns out building a > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite > suprised!), with negligible performance difference at run time. At your > suggestion I tried the ByteRunAutomaton and it was similar to the > CharacterRunAutomaton > in terms of heap and run time. So for now I'm going to just stick to an > Automaton. > > On 14 February 2017 at 00:41, Michael McCandless <luc...@mikemccandless.com> > wrote: > >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mann...@gmail.com> >> wrote: >> >> > I'd like to construct an Automaton to prefix match against a large set of >> > strings. I gather a RunAutomation is immutable, thread safe and faster >> than >> > Automaton. >> >> That's correct. >> >> > Are there any other differences between the three Automaton >> > classes, for example, in memory usage? >> >> CompiledAutomaton is just a thin wrapper to hold onto both the >> original Automaton and the RunAutomaton, plus some other corner-casey >> things that are likely not interesting for your usage. >> >> > Would the general approach for such a problem be to use >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list >> of >> > my string set, set all the states to accept (to enable prefix matching), >> > then pass the Automaton into the constructor of a CharacterRunAutomaton, >> > and use the run method on the CharacterRunAutomaton to match any queries? >> >> That sounds right. >> >> You could also try doing everything in UTF8 space instead, and use >> ByteRunAutomaton: it will likely be faster since it will do faster >> lookups on each transition. It should still be safe to set all states >> as accept, even though some of those states will be inside a single >> Unicode character, as long as the strings you test against are whole >> UTF-8 sequences? >> >> > As it seems like I'm building up the Automaton at least three times, and >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is >> this >> > the most memory efficient way of building such an Automaton? >> >> Yeah, it is. The RunAutomaton will likely require substantial heap in >> your case, probably more than the original automaton. >> >> I suppose you don't actually need to keep the Automaton around once >> the RunAutomaton is built, but Lucene doesn't make this possible >> today, since the RunAutomaton holds onto the Automaton... >> >> > Thanks in advance, >> >> You're welcome! >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org