Hi,
I'm just trying to upgrade POI from 5.2.0 to 5.2.1 for using it with
Apache Tika 2.3.0, but I suddenly see memory problems when processing
DOCX files with embedded images. This looks like a severe bug in POI
5.2.1 to me:
POI 5.2.1 changed XWPFPictureData#getChecksum to call
IOUtils.toByteArrayWithMaxLengthwith a default max length of 100MB
(XWPFPictureData#DEFAULT_MAX_IMAGE_SIZE). The implementation of that
method allocates a byte array of that size by instantiating an
UnsynchronizedByteArrayOutputStream with that max value.
The effect is that 100MB of heap memory are allocated, even if the
embedded image is quite small (less than 1MB in my case).
Here's an exception stack trace where the code is called from Apache Tika:
Caused by: java.io.IOException: java.lang.OutOfMemoryError: Java heap space
at
org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:249)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:201)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
... 9 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.commons.io.IOUtils.byteArray(IOUtils.java:338)
at
org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
at
org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:205)
at
org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:191)
at
org.apache.poi.xwpf.usermodel.XWPFPictureData.getChecksum(XWPFPictureData.java:168)
at
org.apache.poi.xwpf.usermodel.XWPFDocument.registerPackagePictureData(XWPFDocument.java:1460)
at
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:264)
at
org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:169)
at
org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:145)
at
org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:63)
at
org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:224)
... 12 common frames omitted
IOUtils.toByteArrayWithMaxLength is also used at other places in the
code, so the problem might affect other calls as well.
Maybe the checksum could even be implemented in a streaming fashion
without loading the whole data into a byte array? There's even a method
for that in
org.apache.poi.util.IOUtils#calculateChecksum(java.io.InputStream).
But that method also wasn't used for that in earlier versions of POI, so
that's maybe a different topic and not necessary to change.
Thanks in advance for having a look!
Kind Regards,
Andreas