Hello,

I'm digging into possibly corrupt MS Word (.doc) document, under
https://issues.apache.org/jira/browse/TIKA-1072

POI is throwing an exception inside OLE10Native.java:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
        at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
        at 
org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
        at 
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
        at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
        at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

I don't understand the \U0001Ole10Native entry format, so I wanted to
ask you all if 1) this looks corrupt (ie bad document), or 2) it's
possible POI is mis-parsing the bytes.

Here's a hex dump of the 40 bytes:

00000000  24 00 00 00 02 00 01 01  00 0a 01 12 83 46 02 86  |$............F..|
00000010  3d 12 83 49 12 83 6c 12  83 42 12 82 73 12 82 69  |=..I..l..B..s..i|
00000020  12 82 6e 02 84 71 00 00                           |..n..q..|
00000028

Thanks,

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to