Thanks, Your replies ended up in my spam box and therefore I missed your recommendation to use FST. I'll do more testing soon with FST instead of TST. And I'll surely take a look at that talk!
BR, Elmer On Thu, 2011-07-07 at 11:09 +0200, Dawid Weiss wrote: > Elmer. Tst will have a large overhead. Fst may not be that much better if > your input has very few shared pre or suffixes. In your case i think this is > unfortunately true. What i would do is create a regular lucene index and > store it on disk. Then run prefix queries on it. Should work and scale to > large number of ops per sec. See lucene revolution 2011 talks - there was a > talk about using just this instead of a completion module. > > Like mike said though, it'd be interesting to investigate on your data. > On Jul 6, 2011 8:52 PM, "Elmer" <evanchaste...@gmail.com> wrote: > > I just profiled the application and tst.TernaryTreeNode takes 99.99..% of > > the memory. > > > > I'll test further tomorrow and report on mem usage for runnable smaller > > indexes. > > I will email you privately for sharing the index to work with. > > > > BR, > > Elmer > > > > > > -----Oorspronkelijk bericht----- > > From: Michael McCandless > > Sent: Wednesday, July 06, 2011 8:39 PM > > To: java-user@lucene.apache.org > > Subject: Re: Autocompletion on large index > > > > Hmm... so I suspect the fst suggest module must first gather up all > > titles, then sort them, in RAM, and then build the actual FST. Maybe > > it's this gather + sort that's taking so much RAM? > > > > 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So > > that shouldn't be it... > > > > Is this a an accessible corpus? Can I somehow get a copy to play with...? > > > > Are you able to [temporarily, once] build the full FST and other > > suggest impls and compare how much RAM is required for building and > > then lookups? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchaste...@gmail.com> wrote: > >> Hi Mike, > >> > >> That's what I thought when I started indexing it. To be clear, it happens > > >> on > >> build time. > >> I don't know if memory efficiency is better when building has finished. > >> > >> The titles I index are titles from the dblp computer sience bibliography. > >> They can take up to... say 100 characters. > >> Examples: > >> ------- > >> - Auditory stimulus optimization with feedback from fuzzy clustering of > >> neuronal responses > >> - Two-objective method for crisp and fuzzy interval comparison in > >> optimization > >> - Bound Constrained Smooth Optimization for Solving Variational > >> Inequalities > >> and Related Problems > >> - Retrieval of bibliographic records using Apache Lucene > >> - Digital Library Information Appliances > >> ------- > >> > >> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter > in > >> that order. > >> > >> I also tried to do the same for the author names, and this works without > >> problems. Actually it builds the tree/fsa/... faster from dictionary than > >> from file (the lookup data file that can be stored and loaded through the > >> .store and .load methods). But the larger set of publication titles is > >> currently no-go with 2.5GB of heapspace, only having a main class that > >> builds the LookUp data. > >> > >> BR, > >> Elmer > >> > >> > >> -----Oorspronkelijk bericht----- From: Michael McCandless > >> Sent: Wednesday, July 06, 2011 6:23 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: Autocompletion on large index > >> > >> You could try storing your autocomplete index in a RAMDirectory? > >> > >> But: I'm surprised you see the FST suggest impl using up so much RAM; > >> very low memory usage is one of the strengths of the FST approach. > >> Can you share the text (titles) you are feeding to the suggest module? > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote: > >>> > >>> Hi again. > >>> > >>> I have created my own autocompleter based on the spellchecker. This > >>> works well in a sense that it is able to create an auto completion index > >>> from my 'publication' index. However, integrated in my web application, > >>> each keypress asks autocompleter to search the index, which is stored on > >>> disk (not in mem), just like spellchecker does (except that spellchecker > >>> is not invoked every keypress). > >>> With Lucene 3.3.0, auto completion modules are included, which load > >>> their trees/fsa/... in memory. I'd like to use these modules, but the > >>> problem is that they use more than 2.5GB, causing heap space exceptions. > >>> This happens when I try to build a LookUp index (fst,jaspell or tst, > >>> doesn't matter) from my 'publication' index consisting of 1.3M > >>> publications. The field I use for autocompletion holds the titles of the > >>> publications indexed untokenized (but lowercased). > >>> > >>> Code: > >>> Lookup autoCompleter = new TSTLookup(); > >>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX")); > >>> LuceneDictionary dict = new > >>> LuceneDictionary(IndexReader.open(dir),"title_suggest"); > >>> autoCompleter.build(dict); > >>> > >>> Is it possible to have the autocompletion module to work in-memory on > >>> such a dataset without increasing java's heapspace? > >>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where > >>> my own autocompleter index is stored on disk using about 300MB. > >>> > >>> BR, > >>> Elmer > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org