Re: BlockTreeTermsReader consumes crazy amount of memory

Michael McCandless Thu, 28 Aug 2014 10:15:07 -0700

Hmm screen shot didn't make it ... can you post link?

If you are using NRT reader then when a new one is opened, it won't
open new SegmentReaders for all segments, just for newly
flushed/merged segments since the last reader was opened.  So for your
N commit points that you have readers open for, they will be sharing
SegmentReaders for segments they have in common.


How many unique fields are you adding?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <vfunst...@gmail.com> wrote:
> Mike,
>
> Here's the screenshot; not sure if it will go through as an attachment
> though - if not, I'll post it as a link. Please ignore the altered package
> names, since Lucene is shaded in as part of our build process.
>
> Some more context about the use case. Yes, the terms are pretty much unique;
> the schema for the data set is actually borrowed from here:
> https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits
> set, with a couple of other fields added by us. The values for the fields
> are generated almost randomly, though some string fields are picked at
> random from a fixed dictionary.
>
> Also, this type of heap footprint might be tolerable if it stayed relatively
> constant throughout the system's life cycle (of course, given the index set
> stays more or less static). However, what happens here is that one
> IndexReader reference is maintained by ReaderManager as an NRT reader. But
> we also would like support an ability to execute searches against specific
> index commit points, ideally in parallel. As you might imagine, as soon as a
> new DirectoryReader is opened at a given commit, a whole new set of
> SegmentReader instances is created and populated, effectively doubling the
> already large heap usage... if there was a way to somehow reuse readers for
> unchanged segments already pooled by IndexWriter, that would help
> tremendously here. But I don't think there's a way to link up the two sets,
> at least not in the Lucene version we are using (4.6.1) - is this correct?
>
>
> On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>>
>> This is surprising: unless you have an excessive number of unique
>> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
>>
>> Bu you only have 12 unique fields?
>>
>> Can you post screen shots of the heap usage?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <vfunst...@gmail.com>
>> wrote:
>> > This is a follow up to the earlier thread I started to understand memory
>> > usage patterns of SegmentReader instances, but I decided to create a
>> > separate post since this issue is much more serious than the heap
>> > overhead
>> > created by use of stored field compression.
>> >
>> > Here is the use case, once again. The index totals around 300M
>> > documents,
>> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
>> > both
>> > indexed and stored. It is split into 4 shards, which are basically
>> > separate
>> > indices... if that matters. After the index is populated (but not
>> > optimized
>> > since we don't do that), the overall heap usage taken up by Lucene is
>> > over
>> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader.
>> > For
>> > instance for the largest segment in one such an index, the retained heap
>> > size of the internal tree map is around 50 MB. This is evident from heap
>> > dump analysis, which I have screenshots of that I can post here, if that
>> > helps. As there are many segments of various sizes in the index, as
>> > expected, the total heap usage for one shard stands at around 280 MB.
>> >
>> > Could someone shed some light on whether this is expected, and if so -
>> > how
>> > could I possibly trim down memory usage here? Is there a way to switch
>> > to a
>> > different terms index implementation, one that doesn't preload all the
>> > terms into RAM, or only does this partially, i.e. as a cache? I'm not
>> > sure
>> > if I'm framing my questions correctly, as I'm obviously not an expert on
>> > Lucene's internals, but this is going to become a critical issue for
>> > large
>> > scale use cases of our system.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

Reply via email to