Ref 1: I was just about to show you a link at Sun but I realise that it was my misread! OK, so the maximum heap is 2G on a 32-bit Linux platform, which doubles the numbers, and yes indeed 64 bits seems like a good idea, if having sort indexes in RAM is a good use of resources. But there must be a better alternative to using 4 bytes of RAM per document per sort field.
Ref 2: "holding a laptop in both hands, and using the corner of it to type letters on the keyboard of another computer."... I like that analogy... I may even find a use for my laptop now :-) I take your point that Berkley DB would be much less clumsy, but an application that's already using a relational database for other purposes might as well use that relational database, no? I'm not really with you on the random access file, Chris. Here's where I am up to with my [mis-]understanding... I want to sort on 2 terms. Happily these can be ints (the first is an INT corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a hash of a string, used to group similar documents together within those 10 minute timestamps). When I initially warm up the FieldCache (first search after opening the Searcher), I start by generating two random access files with int values at offsets corresponding to document IDs for each of these; the first file would have ints corresponding to the timestamp and the second would have integers corresponding to the hash. I'd then need to generate a third file which is equivalent to an array dimensioned by document ID, with document IDs in compound sort order?? In a big index, it will take a while to walk through all of the documents to generate the first two random access files and the sort process required to generate the sorted file is going to be hard work. -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 31 July 2006 09:34 To: java-user@lucene.apache.org Subject: Re: Sorting 1) I didn't know there were any JVMs that limited the heap size to 1GB ... a 32bit address space would impose a hard limit of 4GB, and I've heard that Windows limits process to 2GB, but I don't know of any JVMs that have 1GB limits. If you really need to deal with indexes big enough for that to make a differnce, you probably want to look into 64bit hardware. 2) ... : Were going to need to maintain a set sort indexes for documents in a : large index too, and I'm interested in suggestions for the best/easiest : way to maintain non-RAM-based (or not entirely RAM-based) sort index : which is external to Lucene. Would using MySQL for sort indexing be "a : sledgehammer to crack a nut", I wonder? I've not really thought through : the RAMifications (sorry!) of this approach. I wonder if anyone else : here has attempted to integrate an external sort using a database? The analogy that comes to mind for me is not "a sledgehammer to crack a nut" ... more along the lines of "holding a laptop in both hands, and using the corner of it to type letters on the keyboard of another computer." Using a relational DB in conjuntion with Lucene just to do some sorting on disk seems like a really gratuitious and unneccessary use of a relational DB. The only reason Field sorting in Lucene uses a lot of RAM is because of hte FieldCache, which provides an easy way to lookup the sort value for a given doc durring hit collection in order to rank them in a priority queue -- namely an array indexed by docId. You could just as easily store that data on disk, you just need an API that lets you lookup things by numeric id. A Berkeley DB "map" comes to mind ... or even random acess files where the value is stored in offsets based on the docId (would have some trickines if you wanted String sorting but would work great for numerics). This would eliminate the high RAM usage, but would be a lot slower because of the disk access (especially on the first search when the "FieldCache" was being built) Alternately, if you assume your results sets are ging to be "small", you could collect all of hte docIds into a set and then iteratre over a complete pass of a TermEnum/TermDocs for your field looking up the sort values for each match -- in esence doing the same work as when building the FieldCache on each search, but only for hte docs that match that search. Really low memory usage, no additional disk usage -- just much slower. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature