Hmm screen shot didn't make it ... can you post link? If you are using NRT reader then when a new one is opened, it won't open new SegmentReaders for all segments, just for newly flushed/merged segments since the last reader was opened. So for your N commit points that you have readers open for, they will be sharing SegmentReaders for segments they have in common.
How many unique fields are you adding? Mike McCandless http://blog.mikemccandless.com On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <vfunst...@gmail.com> wrote: > Mike, > > Here's the screenshot; not sure if it will go through as an attachment > though - if not, I'll post it as a link. Please ignore the altered package > names, since Lucene is shaded in as part of our build process. > > Some more context about the use case. Yes, the terms are pretty much unique; > the schema for the data set is actually borrowed from here: > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits > set, with a couple of other fields added by us. The values for the fields > are generated almost randomly, though some string fields are picked at > random from a fixed dictionary. > > Also, this type of heap footprint might be tolerable if it stayed relatively > constant throughout the system's life cycle (of course, given the index set > stays more or less static). However, what happens here is that one > IndexReader reference is maintained by ReaderManager as an NRT reader. But > we also would like support an ability to execute searches against specific > index commit points, ideally in parallel. As you might imagine, as soon as a > new DirectoryReader is opened at a given commit, a whole new set of > SegmentReader instances is created and populated, effectively doubling the > already large heap usage... if there was a way to somehow reuse readers for > unchanged segments already pooled by IndexWriter, that would help > tremendously here. But I don't think there's a way to link up the two sets, > at least not in the Lucene version we are using (4.6.1) - is this correct? > > > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> >> This is surprising: unless you have an excessive number of unique >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer. >> >> Bu you only have 12 unique fields? >> >> Can you post screen shots of the heap usage? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <vfunst...@gmail.com> >> wrote: >> > This is a follow up to the earlier thread I started to understand memory >> > usage patterns of SegmentReader instances, but I decided to create a >> > separate post since this issue is much more serious than the heap >> > overhead >> > created by use of stored field compression. >> > >> > Here is the use case, once again. The index totals around 300M >> > documents, >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are >> > both >> > indexed and stored. It is split into 4 shards, which are basically >> > separate >> > indices... if that matters. After the index is populated (but not >> > optimized >> > since we don't do that), the overall heap usage taken up by Lucene is >> > over >> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader. >> > For >> > instance for the largest segment in one such an index, the retained heap >> > size of the internal tree map is around 50 MB. This is evident from heap >> > dump analysis, which I have screenshots of that I can post here, if that >> > helps. As there are many segments of various sizes in the index, as >> > expected, the total heap usage for one shard stands at around 280 MB. >> > >> > Could someone shed some light on whether this is expected, and if so - >> > how >> > could I possibly trim down memory usage here? Is there a way to switch >> > to a >> > different terms index implementation, one that doesn't preload all the >> > terms into RAM, or only does this partially, i.e. as a cache? I'm not >> > sure >> > if I'm framing my questions correctly, as I'm obviously not an expert on >> > Lucene's internals, but this is going to become a critical issue for >> > large >> > scale use cases of our system. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org