That sounds like a fun amount of terms! Note that Lucene does not load all terms into memory; only the "prefix trie", stored as an FST ( http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html), mapping term prefixes to on-disk blocks of terms. FSTs are very compact data structures, effectively implementing SortedMap<String,T>, so it's surprising you need 65 G heap for the FSTs.
Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old termInfosIndexDivisor is to change the allowed on-disk block size (defaults to 25 - 48 terms per block) to something larger. To do this, make your own subclass of FilterCodec, passing the current default codec to wrap, and override the postingsFormat method to return a "new Lucene50PostingsFormat(...)" passing a larger min and max block size. This applies at indexing time, so you need to reindex to see your FSTs get smaller. Mike McCandless http://blog.mikemccandless.com On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld <tomhirschf...@gmail.com> wrote: > Hey! > > I am working on a lucene based service for reverse geocoding. We have a > large index with lots of unique terms (550 million) and it appears that > we're running into issue with memory on our leaf servers as the term > dictionary for the entire index is being loaded into heap space. If we > allocate > 65g heap space, our queries return relatively quickly (10s -100s > of ms), but if we drop below ~65g heap space on the leaf nodes, query time > drops dramatically, quickly hitting 20+ seconds (our test harness drops at > 20s). > > I did some research, and found in past versions of lucene, one could split > the loading of the terms dictionary using the 'termInfosIndexDivisor' > option in the directoryReader class. That option was deprecated in lucene > 5.0.0 > <https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html> > in > favor of using codecs to achieve similar functionality. Looking at the > available experimental codecs. I see the BlockTreeTermsWriter > <https://lucene.apache.org/core/5_3_1/core/org/apache/ > lucene/codecs/blocktree/BlockTreeTermsWriter.html# > BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState, > org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems like it > could be used for a similar purpose, breaking down the term dictionary so > that we don't load the whole thing into heap space. > > Has anyone run into this problem before and found an effective solution? > Does changing the codec used seem appropriate for this issue? If so, how do > I got about loading an alternative codec and configuring it to my needs? > I'm having trouble finding docs/examples of how this is used in the real > world so even if you point me to a repo or docs somewhere I'd appreciate > it. > Thanks! > > Best, > Tom Hirschfeld >