Re: Retrieving large numbers of documents from several disks in parallel

2011-12-27 Thread Erick Erickson
I'll take your word for it, though it seems odd. I'm wondering if there's anything you can do to pre-process the documents at index time to make the post-processing less painful, but that's a wild shot in the dark... Another possibility would be to fetch only the fields you need to do the post-pro

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-27 Thread Robert Bart
Erick, Thanks for your reply! You are probably right to question how many Documents we are retrieving. We know it isn't best, but significantly reducing that number will require us to completely rebuild our system. Before we do that, we were just wondering if there was anything in the Lucene API o

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-22 Thread Erick Erickson
I call into question why you "retrieve and materialize as many as 3,000 Documents from each index in order to display a page of results to the user". You have to be doing some post-processing because displaying 12,000 documents to the user is completely useless. I wonder if this is an "XY" problem

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-22 Thread Lance Norskog
Is each index optimized? >From my vague grasp of Lucene file formats, I think you want to sort the documents by segment document id, which is the order of documents on the disk. This lets you materialize documents in their order on the disk. Solr (and other apps) generally use a separate thread p

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-21 Thread Paul Libbrecht
Michael, from a physical point of view, it would seem like the order in which the documents are read is very significant for the reading speed (feel the random access jump as being the issue). You could: - move to ram-disk or ssd to make a difference? - use something different than a searcher w

Retrieving large numbers of documents from several disks in parallel

2011-12-21 Thread Robert Bart
Hi All, I am running Lucene 3.4 in an application that indexes about 1 billion factual assertions (Documents) from the web over four separate disks, so that each disk has a separate index of about 250 million documents. The Documents are relatively small, less than 1KB each. These indexes provide