[ https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662855#comment-16662855 ]
Tim Allison commented on TIKA-2765: ----------------------------------- I think I agree w Nick on where to fix it. We may need to ask for a later exception from commons-compress? Perhaps stream copy what we can that’s intact and then open that per usual? With more time on my hands, I’d want to modify the sax docx and sax pptx parsers to “guess” the opc parts if the rels or other major parts were in the truncated section. > Regression extracting text from corrupted docx files > ---------------------------------------------------- > > Key: TIKA-2765 > URL: https://issues.apache.org/jira/browse/TIKA-2765 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.19.1 > Reporter: Luis Filipe Nassif > Priority: Minor > Attachments: DX IMPORTADORA E EXPORTADORA LTDA.docx > > > Tika-1.19.1 throws the following exception with some corrupt docx files (MS > Word complains but fixes them) previously handled without problems by > tika-1.18. Stacktrace bellow: > {code:java} > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@79efa1ad > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) > at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) > at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) > at > org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) > at > org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) > at javax.swing.TransferHandler.importData(Unknown Source) > at javax.swing.TransferHandler$DropHandler.drop(Unknown Source) > at java.awt.dnd.DropTarget.drop(Unknown Source) > at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source) > at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source) > at > sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown > Source) > at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown > Source) > at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source) > at java.awt.Component.dispatchEventImpl(Unknown Source) > at java.awt.Container.dispatchEventImpl(Unknown Source) > at java.awt.Component.dispatchEvent(Unknown Source) > at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) > at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source) > at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) > at java.awt.Container.dispatchEventImpl(Unknown Source) > at java.awt.Window.dispatchEventImpl(Unknown Source) > at java.awt.Component.dispatchEvent(Unknown Source) > at java.awt.EventQueue.dispatchEventImpl(Unknown Source) > at java.awt.EventQueue.access$500(Unknown Source) > at java.awt.EventQueue$3.run(Unknown Source) > at java.awt.EventQueue$3.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at > java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown > Source) > at > java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown > Source) > at java.awt.EventQueue$4.run(Unknown Source) > at java.awt.EventQueue$4.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at > java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown > Source) > at java.awt.EventQueue.dispatchEvent(Unknown Source) > at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) > at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) > at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) > at java.awt.EventDispatchThread.pumpEvents(Unknown Source) > at java.awt.EventDispatchThread.pumpEvents(Unknown Source) > at java.awt.EventDispatchThread.run(Unknown Source) > Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: > Could not open the specified zip entry source stream > at > org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214) > at > org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196) > at > org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170) > at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151) > at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123) > at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 43 more > Caused by: java.io.EOFException > at > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803) > at > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795) > at > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014) > at > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257) > at > org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139) > at > org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47) > at > org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212) > ... 51 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)