Re: heap memory issues when sorting by a string field

Michael McCandless Thu, 10 Dec 2009 02:46:27 -0800

The big challenge w/ a global sort ords is handling reopen, ie, for
apps that need to frequently reopen their index.


If you have a static index, then this approach is a good one -- you
make a one time investment, after indexing, to compute your global
ords, store them on disk.

Then at searching you load the ords in, and you can make a
FieldComparator that taps into it (just has to re-base the per-segment
docIDs).  No String values are needed...

Also, once we've done
https://issues.apache.org/jira/browse/LUCENE-1990 (anyone out there
wanna take a stab?  Nice self-contained utility API needed for
Lucene....), that ord array can be very compact for cases where the
number of unique Strings is smallish.  No reason to spend a full 4
bytes per document...

Mike

On Wed, Dec 9, 2009 at 6:46 PM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote:
> Thanks for the heads-up, TCK. The Dietz & Sleator article I found at
> http://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf
> looks very interesting.
>
> String sorting in Lucene is indeed fairly expensive and we've experimented 
> with
> two solutions to this, none of which are golden bullets.
>
> 1) Store the order, not the content.
>
> Like the Lucene default sorter, we fetch all the terms for the sort field on 
> first
> sort. The docIDs are mapped to the terms and an complete sort is performed
> (by using a custom Collator: We want to avoid using CollationKeys as they
> take too much memory. But that's another story). When the sort is finished,
> we have an integer-array where each entry is the relative position for a given
> document (referenced by docID). We then de-reference the loaded terms
> so that the memory is freed.
>
> Determining the order of two documents after the initial build is extremely
> simple: order[docIDa] - order[docIDb] (or maybe it was the other way
> around, I forget).
>
> Cons: Large temporarily memory overhead upon initial sort, though not as
> much as using CollationKeys. Slower initial sort than Lucene build-in.
> Pros: Very fast subsequent sorts, very low memory-overhead when running.
> Joker: This approach could be used to perform a post-processing on a
> generated index and storing the orders as a file along with the index, then
> pushing the complete package to searchers. This would lower the memory
> requirements for the searchers significantly.
>
>
> 2) As above, but use a multipass order builder.
>
> For the first sort, the order must be calculated: A sliding window sorter is 
> created
> with a buffer of a fixed size. The window slides through all terms for the 
> sort field
> and extracts the top X. The order-array is updated for the documents that has
> one of these terms. The sliding is repeated multiple times, where terms 
> ordered
> before the last term of the previous iteration are ignored.
>
> Cons: _Very_ slow (too slow in the current implementation) order build.
> Pros: Same as above.
> Joker: The buffer size determines memory use vs. order build time.
>
>
> The multipass approach looks promising, but requires more work to get to a
> usable state. Right now it takes minutes to build the order-array for half a
> million documents, with a buffer size requiring 5 iterations. If I ever get 
> it to
> work, I'll be sure to share it.
>
> Regards,
> Toke Eskildsen
>
> ________________________________________
> From: TCK [moonwatcher32...@gmail.com]
> Sent: 09 December 2009 22:58
> To: java-user@lucene.apache.org
> Subject: Re: heap memory issues when sorting by a string field
>
> Thanks Mike for opening this jira ticket and for your patch. Explicitly
> removing the entry from the WHM definitely does reduce the number of GC
> cycles taken to free the huge StringIndex objects that get created when
> doing a sort by a string field.
>
> But I'm still trying to figure out why it is necessary for lucene to load
> into memory the sorted-by field of every single document in the index (which
> is what makes the StringIndex objects so big) on the first query. Why isn't
> it sufficient to load the field values for only the documents that match the
> given search criteria? Perhaps by using an order-maintenance data structure
> (Dietz&Sleator, or Bender et al) coupled with a balanced search tree,
> instead of the simple lookup and order arrays currently used in StringIndex,
> we can dynamically grow it as needed by successive queries rather than
> loading everything on the first query.
>
> Apologies if this is a naive question... I'm a newbie to the lucene code and
> can't wait for the new lucene book to come out:-)
>
> -TCK
>
>
>
>
> On Tue, Dec 8, 2009 at 5:43 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> I've opened LUCENE-2135.
>>
>> Mike
>>
>> On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>> > This is a rather disturbing implementation detail of WeakHashMap, that
>> > it needs the one extra step (invoking one of its methods) for its weak
>> > keys to be reclaimable.
>> >
>> > Maybe on IndexReader.close(), Lucene should go and evict all entries
>> > in the FieldCache associated with that reader.  Ie, step through the
>> > sub-readers, and if they are truly closed as well (not shared w/ other
>> > readers), evict.  I'll open an issue.
>> >
>> > Even in TCK's code fragment, it's not until the final line is done
>> > executing, that the cache key even loses all hard references, because
>> > it's that line that assigns to sSearcher, replacing the strong
>> > reference to the old searcher.  Inserting sSearcher = null prior to
>> > that would drop the hard reference sooner, but because of this impl
>> > detail of WeakHashMap, something would still have to touch it (eg, a
>> > warmup query that hits the field cache) before it's reclaimable.
>> >
>> > Mike
>> >
>> > On Mon, Dec 7, 2009 at 7:38 PM, Tom Hill <solr-l...@worldware.com>
>> wrote:
>> >> Hi -
>> >>
>> >> If I understand correctly, WeakHashMap does not free the memory for the
>> >> value (cached data) when the key is nulled, or even when the key is
>> garbage
>> >> collected.
>> >>
>> >> It requires one more step: a method on WeakHashMap must be called to
>> allow
>> >> it to release its hard reference to the cached data. It appears that
>> most
>> >> methods in WeakHashMap end up calling expungeStaleEntries, which will
>> clear
>> >> the hard reference. But you have to call some method on the map, before
>> the
>> >> memory is eligible for garbage collection.
>> >>
>> >> So it requires four stages to free the cached data. Null the key; A GC
>> to
>> >> release the weak reference to the key; A call to some method on the map;
>> >> Then the next GC cycle should free the value.
>> >>
>> >> So it seems possible that you could end up with double memory usage for
>> a
>> >> time. If you don't have a GC between the time that you close the old
>> reader,
>> >> and you start to load the field cache entry for the next reader, then
>> the
>> >> key may still be hanging around uncollected.
>> >>
>> >> At that point, it may run a GC when you allocate the new cache, but
>> that's
>> >> only the first GC. It can't free the cached data until after the next
>> call
>> >> to expungeStaleEntries, so for a while you have both caches around.
>> >>
>> >> This extra usage could cause things to move into tenured space. Could
>> this
>> >> be causing your problem?
>> >>
>> >> Workaround would be to cause some method to be called on the
>> WeakHashMap.
>> >> You don't want to call get(), since that will try to populate the cache.
>> >> Maybe if you tried putting a small value to the cache, and doing a GC,
>> and
>> >> see if your memory drops then.
>> >>
>> >>
>> >> Tom
>> >>
>> >>
>> >>
>> >> On Mon, Dec 7, 2009 at 1:48 PM, TCK <moonwatcher32...@gmail.com> wrote:
>> >>
>> >>> Thanks for the response. But I'm definitely calling close() on the old
>> >>> reader and opening a new one (not using reopen). Also, to simplify the
>> >>> analysis, I did my test with a single-threaded requester to eliminate
>> any
>> >>> concurrency issues.
>> >>>
>> >>> I'm doing:
>> >>> sSearcher.getIndexReader().close();
>> >>> sSearcher.close(); // this actually seems to be a no-op
>> >>> IndexReader newIndexReader = IndexReader.open(newDirectory);
>> >>> sSearcher = new IndexSearcher(newIndexReader);
>> >>>
>> >>> Btw, isn't it bad practice anyway to have an unbounded cache? Are there
>> any
>> >>> plans to replace the HashMaps used for the innerCaches with an actual
>> >>> size-bounded cache with some eviction policy (perhaps EhCache or
>> something)
>> >>> ?
>> >>>
>> >>> Thanks again,
>> >>> TCK
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Dec 7, 2009 at 4:37 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >>> >wrote:
>> >>>
>> >>> > What this sounds like is that you're not really closing your
>> >>> > readers even though you think you are. Sorting indeed uses up
>> >>> > significant memory when it populates internal caches and keeps
>> >>> > it around for later use (which is one of the reasons that warming
>> >>> > queries matter). But if you really do close the reader, I'm pretty
>> >>> > sure the memory should be GC-able.
>> >>> >
>> >>> > One thing that trips people up is IndexReader.reopen(). If it
>> >>> > returns a reader different than the original, you *must* close the
>> >>> > old one. If you don't, the old reader is still hanging around and
>> >>> > memory won't be returne.... An example from the Javadocs...
>> >>> >
>> >>> >  IndexReader reader = ...
>> >>> >  ...
>> >>> >  IndexReader new = r.reopen();
>> >>> >  if (new != reader) {
>> >>> >   ...     // reader was reopened
>> >>> >   reader.close();
>> >>> >  }
>> >>> >  reader = new;
>> >>> >  ...
>> >>> >
>> >>> >
>> >>> > If this is irrelevant, could you post your close/open
>> >>> >
>> >>> > code?
>> >>> >
>> >>> > HTH
>> >>> >
>> >>> > Erick
>> >>> >
>> >>> >
>> >>> > On Mon, Dec 7, 2009 at 4:27 PM, TCK <moonwatcher32...@gmail.com>
>> wrote:
>> >>> >
>> >>> > > Hi,
>> >>> > > I'm having heap memory issues when I do lucene queries involving
>> >>> sorting
>> >>> > by
>> >>> > > a string field. Such queries seem to load a lot of data in to the
>> heap.
>> >>> > > Moreover lucene seems to hold on to references to this data even
>> after
>> >>> > the
>> >>> > > index reader has been closed and a full GC has been run. Some of
>> the
>> >>> > > consequences of this are that in my generational heap configuration
>> a
>> >>> lot
>> >>> > > of
>> >>> > > memory gets promoted to tenured space each time I close the old
>> index
>> >>> > > reader
>> >>> > > and after opening and querying using a new one, and the tenured
>> space
>> >>> > > eventually gets fragmented causing a lot of promotion failures
>> >>> resulting
>> >>> > in
>> >>> > > jvm hangs while the jvm does stop-the-world GCs.
>> >>> > >
>> >>> > > Does anyone know any workarounds to avoid these memory issues when
>> >>> doing
>> >>> > > such lucene queries?
>> >>> > >
>> >>> > > My profiling showed that even after a full GC lucene is holding on
>> to a
>> >>> > lot
>> >>> > > of references to field value data notably via the
>> >>> > > FieldCacheImpl/ExtendedFieldCacheImpl. I noticed that the
>> WeakHashMap
>> >>> > > readerCaches are using unbounded HashMaps as the innerCaches and I
>> used
>> >>> > > reflection to replace these innerCaches with dummy empty HashMaps,
>> but
>> >>> > > still
>> >>> > > I'm seeing the same behavior. I wondered if anyone has gone through
>> >>> these
>> >>> > > same issues before and would offer any advice.
>> >>> > >
>> >>> > > Thanks a lot,
>> >>> > > TCK
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: heap memory issues when sorting by a string field

Reply via email to