Assuming content is added in chronological order and with no updates to existing docs couldn't you rely on internal Lucene document id to give a chronological sort order? That would require no memory cache at all when sorting.
Querying across multiple indexes simultaneously however may present an added complication... ----- Original Message ---- From: Aleksander M. Stensby <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 10 October, 2008 13:51:50 Subject: Re: Question regarding sorting and memory consumption in lucene I'll follow up on my own question... Let's say that we have 4 years of data, meaning that there will be roughly 4 * 365 = 1460 unique terms for our sort field. For one index, lets say with 30 million docs, the cache should use approx 100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb for the caches? (and an additional 100mb every time we warm a new searcher and swap it out...) As far as the string versus int or long goes, I don't really see any big gain in changig it since 1460 * 10 bytes extra memory doesnt really make much difference. Or? I guess we should consider reducing the index size or at least only allow sorted search on a subset of the index (or a pruned version of the index...) ? Would that be better for us? But then again, I assume that there are much larger lucene-based indexes out there than ours, and you guys must have some solution to this issue, right? :) best regards, Aleksander On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby <[EMAIL PROTECTED]> wrote: > Hello, I've read a lot of threads now on memory consumption and sorting, > and I think I have a pretty good understanding of how things work, but I > could still need some input here.. > > We currently have a system consisting of 6 different lucene indexes (all > have the same structure, so you could say it is a form of sharding). We > currently use this approach because we want to be able to give users > access to different index (but not necessarily all indexes) etc. > > (We are planning to move to a solr-based system, but for now we would > like to solve this issue with our current lucene-based system.) > > The thing is, the indexes are rather big (ranging from 5G to 20G per > index and 10 - 30 million entries per index.) > We keep one searcher object open per index, and when the index is > changed (new documents added in batches several times a day), we update > the searcher objects. > In the warmup procedure we did a couple of searches and things work > fine, BUT i realized that in our application we return hits sorted by > date by default, and our warmup procedure did non-sorted queries... so > still the first searches done by the user after an update was slow > (obviously). > > To cope, I changed the warmup procedure to include a sorted search, and > now the user will not notice slow queries. Good! > But, the problem at hand is that we are running into memory problems > (and I understand that sorting does consume a lot of memory...) But is > there any way that is "best practice" to deal with this? The field we > sort on is an un_indexed text field representing the date. typically > "2008-10-10". I am aware that string field sorting consumes a lot of > memory, so should we change this field to something different? Would > this help us with the memory problems? > > As a sidenote / couriosity question: Does it matter if we use the search > method returning Hits versus the search method returning TopFieldDocs? > (we are not iterating them in any way when this memory issue occurs) > > Thanks in advance for any guidance we may get. > > Best regards, > Aleksander M. Stensby > > > -- Aleksander M. Stensby Senior Software Developer Integrasco A/S +47 41 22 82 72 [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]