Is that 100MB for a single Lucene document? And is that 100MB for a single field? Is that field analyzed text? How complex is the analyzer? Like, does it do ngrams or something else that is token or memory intensive? Posting the analyzer might help us see what the issue might be.

Try indexing only one document at a time - maybe GC is occurring due to activity on one stream and then the parallel streams are then trying to index while the GC is in progress.

Alternatively, try running with a lot smaller heap since a large heap means GC will take longer.

You might consider a strategy where only one large document can be processed at a time - have other threads pause if a large document is currently being processed or maybe allow only a few large documents to be processed at the same time.

What is your average document size? I mean, are the large documents a rarity so that the above strategy would be reasonable, or do you need to process large numbers of large documents.

-- Jack Krupansky

-----Original Message----- From: ryanb
Sent: Tuesday, November 25, 2014 7:39 PM
To: java-user@lucene.apache.org
Subject: OutOfMemoryError indexing large documents

Hello,

We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index
large documents (100+ MB), but this results in extremely high memory usage,
to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20
documents to be indexed simultaneously, but the text to be analyzed and
indexed is streamed, not loaded into memory all at once.

Any suggestions for how to troubleshoot or ideas about the problem are
greatly appreciated!

Some details about our setup (let me know what other information will help):
- Use MMapDirectory wrapped in a NRTCachingDirectory
- RamBufferSize 64MB
- No compund files
- We commit every 20 seconds

Thanks,
Ryan



--
View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to