mark harwood wrote:


I've been building a large index (hundreds of millions) with mainly structured data which consists of several fields with mostly unique values. I've been hitting out of memory issues when doing periodic commits/ closes which I suspect is down to the sheer number of terms.

I set the IndexWriter..setTermIndexInterval to 8 times the normal size of 128 (an intervalof 1024) which delayed the onset of the issue but still failed.

I think that setting won't change how much RAM is used when writing.

I'd like to get a little more scientific about what to set here rather than simply experimenting with settings and hoping it doesn't fail again.

Does anyone have a decent model worked out for how much memory is consumed at peak? I'm guessing the contributing factors are:

* Numbers of fields
* Numbers of unique terms per field
* Numbers of segments?

Number of net unique terms (across all fields) is a big driver, but also net number of term occurrences, and how many docs. Lots of tiny docs take more RAM than fewer large docs, when # occurrences are equal.

But... how come setting IW's RAM buffer doesn't prevent the OOMs? IW should simply flush when it's used that much RAM.

I don't think number of segments is a factor.

Though mergeFactor is, since during merging the SegmentMerger holds SegmentReaders open, and int[] maps (if there are any deletes) for each segment. Do you have a large merge taking place when you hit the OOMs?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to