Is that 100MB for a single Lucene document? And is that 100MB for a single
field? Is that field analyzed text? How complex is the analyzer? Like, does
it do ngrams or something else that is token or memory intensive? Posting
the analyzer might help us see what the issue might be.
Try indexing only one document at a time - maybe GC is occurring due to
activity on one stream and then the parallel streams are then trying to
index while the GC is in progress.
Alternatively, try running with a lot smaller heap since a large heap means
GC will take longer.
You might consider a strategy where only one large document can be processed
at a time - have other threads pause if a large document is currently being
processed or maybe allow only a few large documents to be processed at the
same time.
What is your average document size? I mean, are the large documents a rarity
so that the above strategy would be reasonable, or do you need to process
large numbers of large documents.
-- Jack Krupansky
-----Original Message-----
From: ryanb
Sent: Tuesday, November 25, 2014 7:39 PM
To: java-user@lucene.apache.org
Subject: OutOfMemoryError indexing large documents
Hello,
We use vanilla Lucene 4.9.0 in a 64 bit Linux OS. We sometimes need to index
large documents (100+ MB), but this results in extremely high memory usage,
to the point of OutOfMemoryError even with 17GB of heap. We allow up to 20
documents to be indexed simultaneously, but the text to be analyzed and
indexed is streamed, not loaded into memory all at once.
Any suggestions for how to troubleshoot or ideas about the problem are
greatly appreciated!
Some details about our setup (let me know what other information will help):
- Use MMapDirectory wrapped in a NRTCachingDirectory
- RamBufferSize 64MB
- No compund files
- We commit every 20 seconds
Thanks,
Ryan
--
View this message in context:
http://lucene.472066.n3.nabble.com/OutOfMemoryError-indexing-large-documents-tp4170983.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org