Re: Autocompletion on large index

Elmer Thu, 07 Jul 2011 07:14:51 -0700

Thanks,
Your replies ended up in my spam box and therefore I missed your
recommendation to use FST. I'll do more testing soon with FST instead of
TST. And I'll surely take a look at that talk!


BR,
Elmer

On Thu, 2011-07-07 at 11:09 +0200, Dawid Weiss wrote:
> Elmer. Tst will have a large overhead. Fst may not be that much better if
> your input has very few shared pre or suffixes. In your case i think this is
> unfortunately true. What i would do is create a regular lucene index and
> store it on disk. Then run prefix queries on it. Should work and scale to
> large number of ops per sec. See lucene revolution 2011 talks - there was a
> talk about using just this instead of a completion module.
> 
> Like mike said though, it'd be interesting to investigate on your data.
> On Jul 6, 2011 8:52 PM, "Elmer" <evanchaste...@gmail.com> wrote:
> > I just profiled the application and tst.TernaryTreeNode takes 99.99..% of
> > the memory.
> >
> > I'll test further tomorrow and report on mem usage for runnable smaller
> > indexes.
> > I will email you privately for sharing the index to work with.
> >
> > BR,
> > Elmer
> >
> >
> > -----Oorspronkelijk bericht-----
> > From: Michael McCandless
> > Sent: Wednesday, July 06, 2011 8:39 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Autocompletion on large index
> >
> > Hmm... so I suspect the fst suggest module must first gather up all
> > titles, then sort them, in RAM, and then build the actual FST. Maybe
> > it's this gather + sort that's taking so much RAM?
> >
> > 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So
> > that shouldn't be it...
> >
> > Is this a an accessible corpus? Can I somehow get a copy to play with...?
> >
> > Are you able to [temporarily, once] build the full FST and other
> > suggest impls and compare how much RAM is required for building and
> > then lookups?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchaste...@gmail.com> wrote:
> >> Hi Mike,
> >>
> >> That's what I thought when I started indexing it. To be clear, it happens
> 
> >> on
> >> build time.
> >> I don't know if memory efficiency is better when building has finished.
> >>
> >> The titles I index are titles from the dblp computer sience bibliography.
> >> They can take up to... say 100 characters.
> >> Examples:
> >> -------
> >> - Auditory stimulus optimization with feedback from fuzzy clustering of
> >> neuronal responses
> >> - Two-objective method for crisp and fuzzy interval comparison in
> >> optimization
> >> - Bound Constrained Smooth Optimization for Solving Variational
> >> Inequalities
> >> and Related Problems
> >> - Retrieval of bibliographic records using Apache Lucene
> >> - Digital Library Information Appliances
> >> -------
> >>
> >> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter
> in
> >> that order.
> >>
> >> I also tried to do the same for the author names, and this works without
> >> problems. Actually it builds the tree/fsa/... faster from dictionary than
> >> from file (the lookup data file that can be stored and loaded through the
> >> .store and .load methods). But the larger set of publication titles is
> >> currently no-go with 2.5GB of heapspace, only having a main class that
> >> builds the LookUp data.
> >>
> >> BR,
> >> Elmer
> >>
> >>
> >> -----Oorspronkelijk bericht----- From: Michael McCandless
> >> Sent: Wednesday, July 06, 2011 6:23 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: Autocompletion on large index
> >>
> >> You could try storing your autocomplete index in a RAMDirectory?
> >>
> >> But: I'm surprised you see the FST suggest impl using up so much RAM;
> >> very low memory usage is one of the strengths of the FST approach.
> >> Can you share the text (titles) you are feeding to the suggest module?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote:
> >>>
> >>> Hi again.
> >>>
> >>> I have created my own autocompleter based on the spellchecker. This
> >>> works well in a sense that it is able to create an auto completion index
> >>> from my 'publication' index. However, integrated in my web application,
> >>> each keypress asks autocompleter to search the index, which is stored on
> >>> disk (not in mem), just like spellchecker does (except that spellchecker
> >>> is not invoked every keypress).
> >>> With Lucene 3.3.0, auto completion modules are included, which load
> >>> their trees/fsa/... in memory. I'd like to use these modules, but the
> >>> problem is that they use more than 2.5GB, causing heap space exceptions.
> >>> This happens when I try to build a LookUp index (fst,jaspell or tst,
> >>> doesn't matter) from my 'publication' index consisting of 1.3M
> >>> publications. The field I use for autocompletion holds the titles of the
> >>> publications indexed untokenized (but lowercased).
> >>>
> >>> Code:
> >>> Lookup autoCompleter = new TSTLookup();
> >>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
> >>> LuceneDictionary dict = new
> >>> LuceneDictionary(IndexReader.open(dir),"title_suggest");
> >>> autoCompleter.build(dict);
> >>>
> >>> Is it possible to have the autocompletion module to work in-memory on
> >>> such a dataset without increasing java's heapspace?
> >>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
> >>> my own autocompleter index is stored on disk using about 300MB.
> >>>
> >>> BR,
> >>> Elmer
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Autocompletion on large index

Reply via email to