Mike- Are the index files synced on writer.close()?
Thank you so much for your help. I think the seek time is the issue, especially considering the high merge factor and the fact that the segments are scattered all over the disk. Will a faster disk cache access affect the optimization and merging? I don't really have a sense for what of the segments are kept in memory during a merge. It doesn't make sense to me that Lucene would pull all of the segments into memory to merge them, but I don't really know how. Thank you so much, Matt Michael McCandless-2 wrote: > > > mattspitz wrote: > >> So, my indexing is done in "rounds", where I pull a bunch of >> documents from >> the database, index them, and flush them to disk. I manually call >> "flush()" >> because I need to ensure that what's on disk is accurate with what >> I've >> pulled from the database. >> >> On each round, then, I flush to disk. I set the buffer such that it >> doesn't >> flush any segments until I manually call flush(), so as to incur I/O >> only >> once each "round" > > Make sure once you upgrade to 2.4 (or trunk) that you switch to > commit() instead of flush() because flush() doesn't sync the index > files, so if the hardware or OS crashes your index will not match > what's in the DB (and/or may become corrupt). > > I'm not sure which of seek time vs throughput is best to optimize in > your IO system. On flushing a segment you'd likely want the fastest > throughput, assuming the filesystem is able to assign many adjacent > blocks to the files being flushed. During merging (and optimize) I > think seek time is most important, because Lucene reads from 50 (your > mergeFactor) files at once and then writes to one or two files. But, > this (at least normal merging) is typically done concurrently with > adding documents, so the time consumed may not matter in the net > runtime of the overall indexing process. When a flush happens during > a merge, seek time is likely most important. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19040147.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]