So, my indexing is done in "rounds", where I pull a bunch of documents from
the database, index them, and flush them to disk.  I manually call "flush()"
because I need to ensure that what's on disk is accurate with what I've
pulled from the database.

On each round, then, I flush to disk.  I set the buffer such that it doesn't
flush any segments until I manually call flush(), so as to incur I/O only
once each "round"

Thanks for your help,
Matt


Otis Gospodnetic wrote:
> 
> Matt,
> 
> One important bit that you didn't mention is what your maxBufferedSize
> setting is.  If it's too low you will see lots of IO.  Increasing it means
> less IO, but more JVM heap need.  Is your disk IO caused by searches or
> indexing only?
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: mattspitz <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Saturday, August 16, 2008 4:07:52 AM
>> Subject: Appropriate disk optimization for large index?
>> 
>> 
>> Hi!  I'm using Lucene 2.3.2 to store a relatively-large index of HTML
>> documents.  I'm storing ~150 million documents, taking up 150 GB of
>> space.
>> 
>> I index the HTML text, but I only store primary key information that
>> allows
>> me to retrieve it later.  Thus, my document size is small, but obviously,
>> I
>> need to store the index as well, and I imagine that's what takes up
>> almost
>> all of the space.
>> 
>> Since I allow users to search this HTML specific to their user index, I
>> create multiple indexes (~2000) such that a given user only has to search
>> one of the 2000 indexes to get to their specific document.  I also have
>> queries that span all 2000 indexes.
>> 
>> So, I have 2000 indexes full of small documents but relatively large
>> indexing space.
>> 
>> My question is what sort of disk to buy.  Using "dstat", I've determined
>> that the disk is clearly the bottleneck.  Nearly all the time I spend
>> indexing "chunks" of documents and committing them to disk is spent
>> waiting
>> on I/O operations.  I spawn multiple threads to access the various index
>> writers so as to minimize I/O wait time, but disk always ends up being
>> the
>> problem.
>> 
>> Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k
>> SAS
>> drives (RAID 0 as well) on hand.
>> 
>> My question is, what's the access pattern of Lucene when it comes to
>> indexing documents, merging segments, and eventually optimizing them
>> (given
>> what I've mentioned about document count and document size)?
>> 
>> Am I better off with a drive that has a faster seek time, or do I need to
>> optimize for sustained throughput?  How does the way in which Lucene lays
>> indexes on disk affect this?
>> 
>> If it helps, my merge factor is 50, and given that I run out of file
>> descriptors otherwise, I use the compound file format.
>> 
>> Thanks for your help,
>> Matt
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19038372.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to