[ 
https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662855#comment-16662855
 ] 

Tim Allison commented on TIKA-2765:
-----------------------------------

I think I agree w Nick on where to fix it.  We may need  to ask for a later 
exception from commons-compress? Perhaps stream copy what we can that’s intact 
and then open that per usual?

With more time on my hands, I’d want to modify the sax docx and sax pptx 
parsers to “guess” the opc parts if the rels or other major parts were in the 
truncated section.


> Regression extracting text from corrupted docx files
> ----------------------------------------------------
>
>                 Key: TIKA-2765
>                 URL: https://issues.apache.org/jira/browse/TIKA-2765
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.19.1
>            Reporter: Luis Filipe Nassif
>            Priority: Minor
>         Attachments: DX IMPORTADORA  E  EXPORTADORA  LTDA.docx
>
>
> Tika-1.19.1 throws the following exception with some corrupt docx files (MS 
> Word complains but fixes them) previously handled without problems by 
> tika-1.18. Stacktrace bellow:
> {code:java}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@79efa1ad
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
> at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
> at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
> at javax.swing.TransferHandler.importData(Unknown Source)
> at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
> at java.awt.dnd.DropTarget.drop(Unknown Source)
> at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
> at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source)
> at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
>  Source)
> at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown 
> Source)
> at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
> at java.awt.Component.dispatchEventImpl(Unknown Source)
> at java.awt.Container.dispatchEventImpl(Unknown Source)
> at java.awt.Component.dispatchEvent(Unknown Source)
> at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
> at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
> at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
> at java.awt.Container.dispatchEventImpl(Unknown Source)
> at java.awt.Window.dispatchEventImpl(Unknown Source)
> at java.awt.Component.dispatchEvent(Unknown Source)
> at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
> at java.awt.EventQueue.access$500(Unknown Source)
> at java.awt.EventQueue$3.run(Unknown Source)
> at java.awt.EventQueue$3.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
>  Source)
> at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
>  Source)
> at java.awt.EventQueue$4.run(Unknown Source)
> at java.awt.EventQueue$4.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
>  Source)
> at java.awt.EventQueue.dispatchEvent(Unknown Source)
> at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
> at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
> at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.run(Unknown Source)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: 
> Could not open the specified zip entry source stream
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214)
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196)
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123)
> at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 43 more
> Caused by: java.io.EOFException
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803)
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795)
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014)
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257)
> at 
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
> at 
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47)
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212)
> ... 51 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to