Re: heap memory issues when sorting by a string field

2009-12-17 Thread Michael McCandless
I think this'd make a nice contribution -- eg it could be bundled up as a FieldComparator impl, eg LowMemoryStringComparator, that would compute the global ords in multiple passes with limited RAM usage. It'd give users the space/time tradeoff... Mike On Mon, Dec 14, 2009 at 9:09 AM, Toke Eskild

Re: heap memory issues when sorting by a string field

2009-12-14 Thread Toke Eskildsen
On Fri, 2009-12-11 at 14:53 +0100, Michael McCandless wrote: > How long does Lucene take to build the ords for the toplevel reader? > > You should be able to just time FieldCache.getStringIndex(topLevelReader). > > I think your 8.5 seconds for first Lucene search was with the > StringIndex compute

Re: heap memory issues when sorting by a string field

2009-12-11 Thread Michael McCandless
. The order-array is updated for the documents that >> > has >> > one of these terms. The sliding is repeated multiple times, where terms >> > ordered >> > before the last term of the previous iteration are ignored. >> > >> > Cons: _Very_ slow (too slow in the current implementation) order bu

Re: heap memory issues when sorting by a string field

2009-12-11 Thread Toke Eskildsen
t; > > > Cons: _Very_ slow (too slow in the current implementation) order build. > > Pros: Same as above. > > Joker: The buffer size determines memory use vs. order build time. > > > > > > The multipass approach looks promising, but requires more work to get

Re: heap memory issues when sorting by a string field

2009-12-10 Thread Toke Eskildsen
> > Joker: The buffer size determines memory use vs. order build time. > > > > > > The multipass approach looks promising, but requires more work to get to a > > usable state. Right now it takes minutes to build the order-array for half a > > million documents, with a buffer size req

Re: heap memory issues when sorting by a string field

2009-12-10 Thread Michael McCandless
On Thu, Dec 10, 2009 at 2:05 AM, Ganesh wrote: > I think, This problem will happen for all sorted fields. I am sorting on > integer field. Integer field should take much less RAM than String, today, for sorting. And there's no efficiency gained by doing this globally (per segment is just fine).

Re: heap memory issues when sorting by a string field

2009-12-10 Thread Michael McCandless
r-array for half a > million documents, with a buffer size requiring 5 iterations. If I ever get > it to > work, I'll be sure to share it. > > Regards, > Toke Eskildsen > > ________ > From: TCK [moonwatcher32...@gmail.com] > Sent: 09 December 20

Re: heap memory issues when sorting by a string field

2009-12-09 Thread Ganesh
e to share it. Regards, Toke Eskildsen From: TCK [moonwatcher32...@gmail.com] Sent: 09 December 2009 22:58 To: java-user@lucene.apache.org Subject: Re: heap memory issues when sorting by a string field Thanks Mike for opening this jira ticket and for your p

RE: heap memory issues when sorting by a string field

2009-12-09 Thread Toke Eskildsen
rds, Toke Eskildsen From: TCK [moonwatcher32...@gmail.com] Sent: 09 December 2009 22:58 To: java-user@lucene.apache.org Subject: Re: heap memory issues when sorting by a string field Thanks Mike for opening this jira ticket and for your patch. Explicitly removing the entry from the

Re: heap memory issues when sorting by a string field

2009-12-09 Thread Michael McCandless
It's not that it's "necessary" -- this is just how Lucene's sorting has always worked ;) But, it's just software! You could whip up a patch... I'm not familiar with the order-maintenance problem & solutions offhand, but it certainly sounds interesting. One issue is that loading only certain val

Re: heap memory issues when sorting by a string field

2009-12-09 Thread TCK
Thanks Mike for opening this jira ticket and for your patch. Explicitly removing the entry from the WHM definitely does reduce the number of GC cycles taken to free the huge StringIndex objects that get created when doing a sort by a string field. But I'm still trying to figure out why it is neces

Re: heap memory issues when sorting by a string field

2009-12-08 Thread Michael McCandless
I've opened LUCENE-2135. Mike On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless wrote: > This is a rather disturbing implementation detail of WeakHashMap, that > it needs the one extra step (invoking one of its methods) for its weak > keys to be reclaimable. > > Maybe on IndexReader.close(), Lu

Re: heap memory issues when sorting by a string field

2009-12-08 Thread Michael McCandless
This is a rather disturbing implementation detail of WeakHashMap, that it needs the one extra step (invoking one of its methods) for its weak keys to be reclaimable. Maybe on IndexReader.close(), Lucene should go and evict all entries in the FieldCache associated with that reader. Ie, step throug

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Jason Rutherglen
TCK, CSIndexInput is returned by SegmentReader.getFieldCacheKey() If you think it's an issue, then it'd be good to open an issue and submit some code as a patch, maybe a test case showing the WHM isn't removing values like it's supposed to. Jason On Mon, Dec 7, 2009 at 10:45 PM, TCK wrote: > T

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Jason Rutherglen
> It's an apache license - but you mentioned something about no third party > libraries. Is that a policy for Lucene? Pretty much... Though one can always submit a patch anyways. On Mon, Dec 7, 2009 at 4:57 PM, Tom Hill wrote: > Hey, that's a nice little Class! I hadn't see it before. But it so

Re: heap memory issues when sorting by a string field

2009-12-07 Thread TCK
Thanks for the feedback guys. The evidence I have collected does point to an issue either in the java WeakHashMap implementation or in Lucene's use of it. In particular, I used reflection to replace the WeakHashMap instances with my own dummy Map that does a no-op for the put operation, and althoug

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Tom Hill
Hey, that's a nice little Class! I hadn't see it before. But it sounds like the asynchronous cleanup might deal with the problem I mentioned above (but I haven't looked at the code yet). It's an apache license - but you mentioned something about no third party libraries. Is that a policy for Lucen

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Jason Rutherglen
I wonder if Google Collections (even though we don't use third party libraries) concurrent map, which supports weak keys, handles the removal of weakly referenced keys in a more elegant way than Java's WeakHashMap? On Mon, Dec 7, 2009 at 4:38 PM, Tom Hill wrote: > Hi - > > If I understand correct

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Tom Hill
Hi - If I understand correctly, WeakHashMap does not free the memory for the value (cached data) when the key is nulled, or even when the key is garbage collected. It requires one more step: a method on WeakHashMap must be called to allow it to release its hard reference to the cached data. It ap

Re: heap memory issues when sorting by a string field

2009-12-07 Thread TCK
Thanks for the response. But I'm definitely calling close() on the old reader and opening a new one (not using reopen). Also, to simplify the analysis, I did my test with a single-threaded requester to eliminate any concurrency issues. I'm doing: sSearcher.getIndexReader().close(); sSearcher.close

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Erick Erickson
What this sounds like is that you're not really closing your readers even though you think you are. Sorting indeed uses up significant memory when it populates internal caches and keeps it around for later use (which is one of the reasons that warming queries matter). But if you really do close the