I should add that in my situation, the number of documents that fit in ram is...er...problematical to determine. My current project is composed of books that I chose to index as a single book at a time.
Unfortunately, answering the question "how big is a book" doesn't help much, they range from 2 pages to over 7,000 pages. So how to set the various indexing parameters, especially maxBufferedDocs is hard to balance between efficiency and memory. Will I happen to get a string of 100 large books? If so, I need to set the limit to a small number. Which will not be terribly efficient for the "usual" case. That said, I don't much care about efficiency in this case. I can't generate the index quickly (20,000+ books) and the factors I've chosen let me generate it between the time I leave work and the time I get back in the morning, so I don't really need much more tweaking. But this illustrates why I referred to picking factors as a "guess". With a heterogeneous index where the documents vary widely in size, picking parameters isn't straight-forward. My current parameters may not work if I index the documents in a different order than I am currently. I just don't know. They may even not work on the next set of data, since much of the data is OCR and for many books it's pretty trashy and/or incomplete (imagine the OCR output of a genealogy book that consists entirely of a stylized tree with the names written by hand along the branches in many orientations!). We're promised much better OCR data in the next set of books we index, which may blow my current indexer out of the watter. Which is why I'm so glad that the ramSizeInBytes has been added. It seems to me that I can now create a reasonably generalized way to index heterogeneous documents with "good enough" efficiency. I'm imagining keeping a few simple statistics, like size of incoming document and change in index size as a result of indexing that doc. This should allow me to figure out a reasonable factor for predicting how much the *next* addition will increase the index and flushing ram based upon that prediction. With, probably, quite a large safety margin. I don't really care if I get every last efficiency in this case. What I *do* care about is that the indexing run completes and this new capability seems to allow me to insure that without penalizing the bulk of my indexing because I have a few edge cases. Anyway, thanks for adding this capability, which I'll probably use in the pretty near future. And thanks Michael for your explanation of what these factors really do. It may have been documented before, but this time it finally is sticking in my aging brain... Erick On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
"Erick Erickson" <[EMAIL PROTECTED]> wrote: > I haven't used it yet, but I've seen several references to > IndexWriter.ramSizeInBytes() and using it to control when the writer > flushes the RAM. This seems like a more deterministic way of > making things efficient than trying various combinations of > maxBufferedDocs , MergeFactor, etc, all of which are guesses > at best. I agree this is the most efficient way to flush. The one caveat is this Jira issue: http://issues.apache.org/jira/browse/LUCENE-845 which can cause over-merging if you make maxBufferedDocs too large. I think the rule of thumb to avoid this issue is 1) set maxBufferedDocs to be no more than 10X the "typical" number of docs you will flush, and then 2) flush by RAM usage. So for example if when you flush by RAM you typically flush "around" 200-300 docs, then setting maxBufferedDocs to eg 1000 is good since it's far above 200-300 (so it won't trigger a flush when you didn't want it to) but it's also well below 10X your range of docs (so it won't tickle the above bug). Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]