Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML documents. I'm storing ~150 million documents, taking up 150 GB of space.
I index the HTML text, but I only store primary key information that allows me to retrieve it later. Thus, my document size is small, but obviously, I need to store the index as well, and I imagine that's what takes up almost all of the space. Since I allow users to search this HTML specific to their user index, I create multiple indexes (~2000) such that a given user only has to search one of the 2000 indexes to get to their specific document. I also have queries that span all 2000 indexes. So, I have 2000 indexes full of small documents but relatively large indexing space. My question is what sort of disk to buy. Using "dstat", I've determined that the disk is clearly the bottleneck. Nearly all the time I spend indexing "chunks" of documents and committing them to disk is spent waiting on I/O operations. I spawn multiple threads to access the various index writers so as to minimize I/O wait time, but disk always ends up being the problem. Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k SAS drives (RAID 0 as well) on hand. My question is, what's the access pattern of Lucene when it comes to indexing documents, merging segments, and eventually optimizing them (given what I've mentioned about document count and document size)? Am I better off with a drive that has a faster seek time, or do I need to optimize for sustained throughput? How does the way in which Lucene lays indexes on disk affect this? If it helps, my merge factor is 50, and given that I run out of file descriptors otherwise, I use the compound file format. Thanks for your help, Matt -- View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]