So, my indexing is done in "rounds", where I pull a bunch of documents from the database, index them, and flush them to disk. I manually call "flush()" because I need to ensure that what's on disk is accurate with what I've pulled from the database.
On each round, then, I flush to disk. I set the buffer such that it doesn't flush any segments until I manually call flush(), so as to incur I/O only once each "round" Thanks for your help, Matt Otis Gospodnetic wrote: > > Matt, > > One important bit that you didn't mention is what your maxBufferedSize > setting is. If it's too low you will see lots of IO. Increasing it means > less IO, but more JVM heap need. Is your disk IO caused by searches or > indexing only? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: mattspitz <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Saturday, August 16, 2008 4:07:52 AM >> Subject: Appropriate disk optimization for large index? >> >> >> Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML >> documents. I'm storing ~150 million documents, taking up 150 GB of >> space. >> >> I index the HTML text, but I only store primary key information that >> allows >> me to retrieve it later. Thus, my document size is small, but obviously, >> I >> need to store the index as well, and I imagine that's what takes up >> almost >> all of the space. >> >> Since I allow users to search this HTML specific to their user index, I >> create multiple indexes (~2000) such that a given user only has to search >> one of the 2000 indexes to get to their specific document. I also have >> queries that span all 2000 indexes. >> >> So, I have 2000 indexes full of small documents but relatively large >> indexing space. >> >> My question is what sort of disk to buy. Using "dstat", I've determined >> that the disk is clearly the bottleneck. Nearly all the time I spend >> indexing "chunks" of documents and committing them to disk is spent >> waiting >> on I/O operations. I spawn multiple threads to access the various index >> writers so as to minimize I/O wait time, but disk always ends up being >> the >> problem. >> >> Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k >> SAS >> drives (RAID 0 as well) on hand. >> >> My question is, what's the access pattern of Lucene when it comes to >> indexing documents, merging segments, and eventually optimizing them >> (given >> what I've mentioned about document count and document size)? >> >> Am I better off with a drive that has a faster seek time, or do I need to >> optimize for sustained throughput? How does the way in which Lucene lays >> indexes on disk affect this? >> >> If it helps, my merge factor is 50, and given that I run out of file >> descriptors otherwise, I use the compound file format. >> >> Thanks for your help, >> Matt >> -- >> View this message in context: >> http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19038372.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]