Re: Appropriate disk optimization for large index?

Michael McCandless Mon, 18 Aug 2008 13:54:56 -0700


mattspitz wrote:

So, my indexing is done in "rounds", where I pull a bunch ofdocuments fromthe database, index them, and flush them to disk. I manually call"flush()"because I need to ensure that what's on disk is accurate with whatI've
pulled from the database.
On each round, then, I flush to disk. I set the buffer such that itdoesn'tflush any segments until I manually call flush(), so as to incur I/Oonly
once each "round"

Make sure once you upgrade to 2.4 (or trunk) that you switch tocommit() instead of flush() because flush() doesn't sync the indexfiles, so if the hardware or OS crashes your index will not matchwhat's in the DB (and/or may become corrupt).

I'm not sure which of seek time vs throughput is best to optimize inyour IO system. On flushing a segment you'd likely want the fastestthroughput, assuming the filesystem is able to assign many adjacentblocks to the files being flushed. During merging (and optimize) Ithink seek time is most important, because Lucene reads from 50 (yourmergeFactor) files at once and then writes to one or two files. But,this (at least normal merging) is typically done concurrently withadding documents, so the time consumed may not matter in the netruntime of the overall indexing process. When a flush happens duringa merge, seek time is likely most important.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Appropriate disk optimization for large index?

Reply via email to