Elmer. Tst will have a large overhead. Fst may not be that much better if your input has very few shared pre or suffixes. In your case i think this is unfortunately true. What i would do is create a regular lucene index and store it on disk. Then run prefix queries on it. Should work and scale to large number of ops per sec. See lucene revolution 2011 talks - there was a talk about using just this instead of a completion module.
Like mike said though, it'd be interesting to investigate on your data. On Jul 6, 2011 8:52 PM, "Elmer" <evanchaste...@gmail.com> wrote: > I just profiled the application and tst.TernaryTreeNode takes 99.99..% of > the memory. > > I'll test further tomorrow and report on mem usage for runnable smaller > indexes. > I will email you privately for sharing the index to work with. > > BR, > Elmer > > > -----Oorspronkelijk bericht----- > From: Michael McCandless > Sent: Wednesday, July 06, 2011 8:39 PM > To: java-user@lucene.apache.org > Subject: Re: Autocompletion on large index > > Hmm... so I suspect the fst suggest module must first gather up all > titles, then sort them, in RAM, and then build the actual FST. Maybe > it's this gather + sort that's taking so much RAM? > > 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So > that shouldn't be it... > > Is this a an accessible corpus? Can I somehow get a copy to play with...? > > Are you able to [temporarily, once] build the full FST and other > suggest impls and compare how much RAM is required for building and > then lookups? > > Mike McCandless > > http://blog.mikemccandless.com > > On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchaste...@gmail.com> wrote: >> Hi Mike, >> >> That's what I thought when I started indexing it. To be clear, it happens >> on >> build time. >> I don't know if memory efficiency is better when building has finished. >> >> The titles I index are titles from the dblp computer sience bibliography. >> They can take up to... say 100 characters. >> Examples: >> ------- >> - Auditory stimulus optimization with feedback from fuzzy clustering of >> neuronal responses >> - Two-objective method for crisp and fuzzy interval comparison in >> optimization >> - Bound Constrained Smooth Optimization for Solving Variational >> Inequalities >> and Related Problems >> - Retrieval of bibliographic records using Apache Lucene >> - Digital Library Information Appliances >> ------- >> >> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in >> that order. >> >> I also tried to do the same for the author names, and this works without >> problems. Actually it builds the tree/fsa/... faster from dictionary than >> from file (the lookup data file that can be stored and loaded through the >> .store and .load methods). But the larger set of publication titles is >> currently no-go with 2.5GB of heapspace, only having a main class that >> builds the LookUp data. >> >> BR, >> Elmer >> >> >> -----Oorspronkelijk bericht----- From: Michael McCandless >> Sent: Wednesday, July 06, 2011 6:23 PM >> To: java-user@lucene.apache.org >> Subject: Re: Autocompletion on large index >> >> You could try storing your autocomplete index in a RAMDirectory? >> >> But: I'm surprised you see the FST suggest impl using up so much RAM; >> very low memory usage is one of the strengths of the FST approach. >> Can you share the text (titles) you are feeding to the suggest module? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote: >>> >>> Hi again. >>> >>> I have created my own autocompleter based on the spellchecker. This >>> works well in a sense that it is able to create an auto completion index >>> from my 'publication' index. However, integrated in my web application, >>> each keypress asks autocompleter to search the index, which is stored on >>> disk (not in mem), just like spellchecker does (except that spellchecker >>> is not invoked every keypress). >>> With Lucene 3.3.0, auto completion modules are included, which load >>> their trees/fsa/... in memory. I'd like to use these modules, but the >>> problem is that they use more than 2.5GB, causing heap space exceptions. >>> This happens when I try to build a LookUp index (fst,jaspell or tst, >>> doesn't matter) from my 'publication' index consisting of 1.3M >>> publications. The field I use for autocompletion holds the titles of the >>> publications indexed untokenized (but lowercased). >>> >>> Code: >>> Lookup autoCompleter = new TSTLookup(); >>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX")); >>> LuceneDictionary dict = new >>> LuceneDictionary(IndexReader.open(dir),"title_suggest"); >>> autoCompleter.build(dict); >>> >>> Is it possible to have the autocompletion module to work in-memory on >>> such a dataset without increasing java's heapspace? >>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where >>> my own autocompleter index is stored on disk using about 300MB. >>> >>> BR, >>> Elmer >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >