Re: Appropriate disk optimization for large index?

mattspitz Mon, 18 Aug 2008 14:18:03 -0700

Mike-

Are the index files synced on writer.close()?


Thank you so much for your help.  I think the seek time is the issue,
especially considering the high merge factor and the fact that the segments
are scattered all over the disk.

Will a faster disk cache access affect the optimization and merging?  I
don't really have a sense for what of the segments are kept in memory during
a merge.  It doesn't make sense to me that Lucene would pull all of the
segments into memory to merge them, but I don't really know how.

Thank you so much,
Matt

Michael McCandless-2 wrote:
> 
> 
> mattspitz wrote:
> 
>> So, my indexing is done in "rounds", where I pull a bunch of  
>> documents from
>> the database, index them, and flush them to disk.  I manually call  
>> "flush()"
>> because I need to ensure that what's on disk is accurate with what  
>> I've
>> pulled from the database.
>>
>> On each round, then, I flush to disk.  I set the buffer such that it  
>> doesn't
>> flush any segments until I manually call flush(), so as to incur I/O  
>> only
>> once each "round"
> 
> Make sure once you upgrade to 2.4 (or trunk) that you switch to  
> commit() instead of flush() because flush() doesn't sync the index  
> files, so if the hardware or OS crashes your index will not match  
> what's in the DB (and/or may become corrupt).
> 
> I'm not sure which of seek time vs throughput is best to optimize in  
> your IO system.  On flushing a segment you'd likely want the fastest  
> throughput, assuming the filesystem is able to assign many adjacent  
> blocks to the files being flushed.  During merging (and optimize) I  
> think seek time is most important, because Lucene reads from 50 (your  
> mergeFactor) files at once and then writes to one or two files.  But,  
> this (at least normal merging) is typically done concurrently with  
> adding documents, so the time consumed may not matter in the net  
> runtime of the overall indexing process.  When a flush happens during  
> a merge, seek time is likely most important.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19040147.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Appropriate disk optimization for large index?

Reply via email to