Re: Question regarding sorting and memory consumption in lucene

Aleksander M. Stensby Fri, 10 Oct 2008 05:52:24 -0700

I'll follow up on my own question...

Let's say that we have 4 years of data, meaning that there will be roughly4 * 365 = 1460 unique terms for our sort field.For one index, lets say with 30 million docs, the cache should use approx100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mbfor the caches? (and an additional 100mb every time we warm a new searcherand swap it out...) As far as the string versus int or long goes, I don'treally see any big gain in changig it since 1460 * 10 bytes extra memorydoesnt really make much difference. Or?

I guess we should consider reducing the index size or at least only allowsorted search on a subset of the index (or a pruned version of theindex...) ? Would that be better for us?But then again, I assume that there are much larger lucene-based indexesout there than ours, and you guys must have some solution to this issue,right? :)


best regards,
 Aleksander

On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby<[EMAIL PROTECTED]> wrote:

Hello, I've read a lot of threads now on memory consumption and sorting,and I think I have a pretty good understanding of how things work, but Icould still need some input here..
We currently have a system consisting of 6 different lucene indexes (allhave the same structure, so you could say it is a form of sharding). Wecurrently use this approach because we want to be able to give usersaccess to different index (but not necessarily all indexes) etc.
(We are planning to move to a solr-based system, but for now we wouldlike to solve this issue with our current lucene-based system.)
The thing is, the indexes are rather big (ranging from 5G to 20G perindex and 10 - 30 million entries per index.)We keep one searcher object open per index, and when the index ischanged (new documents added in batches several times a day), we updatethe searcher objects.In the warmup procedure we did a couple of searches and things workfine, BUT i realized that in our application we return hits sorted bydate by default, and our warmup procedure did non-sorted queries... sostill the first searches done by the user after an update was slow(obviously).
To cope, I changed the warmup procedure to include a sorted search, andnow the user will not notice slow queries. Good!But, the problem at hand is that we are running into memory problems(and I understand that sorting does consume a lot of memory...) But isthere any way that is "best practice" to deal with this? The field wesort on is an un_indexed text field representing the date. typically"2008-10-10". I am aware that string field sorting consumes a lot ofmemory, so should we change this field to something different? Wouldthis help us with the memory problems?
As a sidenote / couriosity question: Does it matter if we use the searchmethod returning Hits versus the search method returning TopFieldDocs?(we are not iterating them in any way when this memory issue occurs)
Thanks in advance for any guidance we may get.

Best regards,
  Aleksander M. Stensby




--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question regarding sorting and memory consumption in lucene

Reply via email to