I got it working by modifying TSTAutocomplete to have a limit on the prefix length. :) The depth of the tree will not go deeper than this prefix length.
When set to 20 chars, total mem usage is ~520MB, of which 48,9% is for the TernaryTreeNode objects. Building took 7 seconds, reading from external HDD. I created a zip, with JAR and sourcecode, available here: http://www.computer-tuning.nl/lucene/TSTLookupWithPrefixLimit.zip You still need the spellchecker for dependencies. BR, Elmer On Wed, 2011-07-06 at 20:52 +0200, Elmer wrote: > I just profiled the application and tst.TernaryTreeNode takes 99.99..% of > the memory. > > I'll test further tomorrow and report on mem usage for runnable smaller > indexes. > I will email you privately for sharing the index to work with. > > BR, > Elmer > > > -----Oorspronkelijk bericht----- > From: Michael McCandless > Sent: Wednesday, July 06, 2011 8:39 PM > To: java-user@lucene.apache.org > Subject: Re: Autocompletion on large index > > Hmm... so I suspect the fst suggest module must first gather up all > titles, then sort them, in RAM, and then build the actual FST. Maybe > it's this gather + sort that's taking so much RAM? > > 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So > that shouldn't be it... > > Is this a an accessible corpus? Can I somehow get a copy to play with...? > > Are you able to [temporarily, once] build the full FST and other > suggest impls and compare how much RAM is required for building and > then lookups? > > Mike McCandless > > http://blog.mikemccandless.com > > On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchaste...@gmail.com> wrote: > > Hi Mike, > > > > That's what I thought when I started indexing it. To be clear, it happens > > on > > build time. > > I don't know if memory efficiency is better when building has finished. > > > > The titles I index are titles from the dblp computer sience bibliography. > > They can take up to... say 100 characters. > > Examples: > > ------- > > - Auditory stimulus optimization with feedback from fuzzy clustering of > > neuronal responses > > - Two-objective method for crisp and fuzzy interval comparison in > > optimization > > - Bound Constrained Smooth Optimization for Solving Variational > > Inequalities > > and Related Problems > > - Retrieval of bibliographic records using Apache Lucene > > - Digital Library Information Appliances > > ------- > > > > The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in > > that order. > > > > I also tried to do the same for the author names, and this works without > > problems. Actually it builds the tree/fsa/... faster from dictionary than > > from file (the lookup data file that can be stored and loaded through the > > .store and .load methods). But the larger set of publication titles is > > currently no-go with 2.5GB of heapspace, only having a main class that > > builds the LookUp data. > > > > BR, > > Elmer > > > > > > -----Oorspronkelijk bericht----- From: Michael McCandless > > Sent: Wednesday, July 06, 2011 6:23 PM > > To: java-user@lucene.apache.org > > Subject: Re: Autocompletion on large index > > > > You could try storing your autocomplete index in a RAMDirectory? > > > > But: I'm surprised you see the FST suggest impl using up so much RAM; > > very low memory usage is one of the strengths of the FST approach. > > Can you share the text (titles) you are feeding to the suggest module? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote: > >> > >> Hi again. > >> > >> I have created my own autocompleter based on the spellchecker. This > >> works well in a sense that it is able to create an auto completion index > >> from my 'publication' index. However, integrated in my web application, > >> each keypress asks autocompleter to search the index, which is stored on > >> disk (not in mem), just like spellchecker does (except that spellchecker > >> is not invoked every keypress). > >> With Lucene 3.3.0, auto completion modules are included, which load > >> their trees/fsa/... in memory. I'd like to use these modules, but the > >> problem is that they use more than 2.5GB, causing heap space exceptions. > >> This happens when I try to build a LookUp index (fst,jaspell or tst, > >> doesn't matter) from my 'publication' index consisting of 1.3M > >> publications. The field I use for autocompletion holds the titles of the > >> publications indexed untokenized (but lowercased). > >> > >> Code: > >> Lookup autoCompleter = new TSTLookup(); > >> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX")); > >> LuceneDictionary dict = new > >> LuceneDictionary(IndexReader.open(dir),"title_suggest"); > >> autoCompleter.build(dict); > >> > >> Is it possible to have the autocompletion module to work in-memory on > >> such a dataset without increasing java's heapspace? > >> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where > >> my own autocompleter index is stored on disk using about 300MB. > >> > >> BR, > >> Elmer > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org