Hello, I'm digging into possibly corrupt MS Word (.doc) document, under https://issues.apache.org/jira/browse/TIKA-1072
POI is throwing an exception inside OLE10Native.java: Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) I don't understand the \U0001Ole10Native entry format, so I wanted to ask you all if 1) this looks corrupt (ie bad document), or 2) it's possible POI is mis-parsing the bytes. Here's a hex dump of the 40 bytes: 00000000 24 00 00 00 02 00 01 01 00 0a 01 12 83 46 02 86 |$............F..| 00000010 3d 12 83 49 12 83 6c 12 83 42 12 82 73 12 82 69 |=..I..l..B..s..i| 00000020 12 82 6e 02 84 71 00 00 |..n..q..| 00000028 Thanks, Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org