mattspitz wrote:
So, my indexing is done in "rounds", where I pull a bunch of
documents from
the database, index them, and flush them to disk. I manually call
"flush()"
because I need to ensure that what's on disk is accurate with what
I've
pulled from the database.
On each round, then, I flush to disk. I set the buffer such that it
doesn't
flush any segments until I manually call flush(), so as to incur I/O
only
once each "round"
Make sure once you upgrade to 2.4 (or trunk) that you switch to
commit() instead of flush() because flush() doesn't sync the index
files, so if the hardware or OS crashes your index will not match
what's in the DB (and/or may become corrupt).
I'm not sure which of seek time vs throughput is best to optimize in
your IO system. On flushing a segment you'd likely want the fastest
throughput, assuming the filesystem is able to assign many adjacent
blocks to the files being flushed. During merging (and optimize) I
think seek time is most important, because Lucene reads from 50 (your
mergeFactor) files at once and then writes to one or two files. But,
this (at least normal merging) is typically done concurrently with
adding documents, so the time consumed may not matter in the net
runtime of the overall indexing process. When a flush happens during
a merge, seek time is likely most important.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]