https://bz.apache.org/bugzilla/show_bug.cgi?id=57843
Bug ID: 57843
Summary: RuntimeException on extracting text from Word 97-2004
Document
Product: POI
Version: 3.12-dev
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
Assignee: [email protected]
Reporter: [email protected]
Created attachment 32674
--> https://bz.apache.org/bugzilla/attachment.cgi?id=32674&action=edit
failing document
Trying to parse this document via Tika. It appears to be a pretty vanilla Word
97 era .doc. It opens correctly in Word for Mac 2011.
It's attached. The document is already publicly posted and I grant any rights I
have in the document to ASF; I should note that it's part of a publicly-posted
dump of emails sent to/from former Florida Gov. Jeb Bush, so I don't hold
copyright over it.
This is the POI version of https://issues.apache.org/jira/browse/TIKA-1608
Stacktrace looks like this:
$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text
1534-attachment.doc
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at
org.apache.poi.hwpf.model.PAPFormattedDiskPage.<init>(PAPFormattedDiskPage.java:101)
at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:109)
at
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]