Hi,

I'm just trying to upgrade POI from 5.2.0 to 5.2.1 for using it with Apache Tika 2.3.0, but I suddenly see memory problems when processing DOCX files with embedded images. This looks like a severe bug in POI 5.2.1 to me:

POI 5.2.1 changed XWPFPictureData#getChecksum to call IOUtils.toByteArrayWithMaxLengthwith a default max length of 100MB (XWPFPictureData#DEFAULT_MAX_IMAGE_SIZE). The implementation of that method allocates a byte array of that size by instantiating an UnsynchronizedByteArrayOutputStream with that max value.

The effect is that 100MB of heap memory are allocated, even if the embedded image is quite small (less than 1MB in my case).

Here's an exception stack trace where the code is called from Apache Tika:

Caused by: java.io.IOException: java.lang.OutOfMemoryError: Java heap space
        at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:249)         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:201)         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
        ... 9 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.commons.io.IOUtils.byteArray(IOUtils.java:338)
        at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)         at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:205)
        at org.apache.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:191)         at org.apache.poi.xwpf.usermodel.XWPFPictureData.getChecksum(XWPFPictureData.java:168)         at org.apache.poi.xwpf.usermodel.XWPFDocument.registerPackagePictureData(XWPFDocument.java:1460)         at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:264)         at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:169)         at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:145)         at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:63)         at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:224)
        ... 12 common frames omitted

IOUtils.toByteArrayWithMaxLength is also used at other places in the code, so the problem might affect other calls as well.

Maybe the checksum could even be implemented in a streaming fashion without loading the whole data into a byte array? There's even a method for that in org.apache.poi.util.IOUtils#calculateChecksum(java.io.InputStream). But that method also wasn't used for that in earlier versions of POI, so that's maybe a different topic and not necessary to change.

Thanks in advance for having a look!

Kind Regards,
Andreas



Reply via email to