mattspitz wrote:

So, my indexing is done in "rounds", where I pull a bunch of documents from the database, index them, and flush them to disk. I manually call "flush()" because I need to ensure that what's on disk is accurate with what I've
pulled from the database.

On each round, then, I flush to disk. I set the buffer such that it doesn't flush any segments until I manually call flush(), so as to incur I/O only
once each "round"

Make sure once you upgrade to 2.4 (or trunk) that you switch to commit() instead of flush() because flush() doesn't sync the index files, so if the hardware or OS crashes your index will not match what's in the DB (and/or may become corrupt).

I'm not sure which of seek time vs throughput is best to optimize in your IO system. On flushing a segment you'd likely want the fastest throughput, assuming the filesystem is able to assign many adjacent blocks to the files being flushed. During merging (and optimize) I think seek time is most important, because Lucene reads from 50 (your mergeFactor) files at once and then writes to one or two files. But, this (at least normal merging) is typically done concurrently with adding documents, so the time consumed may not matter in the net runtime of the overall indexing process. When a flush happens during a merge, seek time is likely most important.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to