Michael McCandless created TIKA-987: ---------------------------------------
Summary: Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted Key: TIKA-987 URL: https://issues.apache.org/jira/browse/TIKA-987 Project: Tika Issue Type: Bug Reporter: Michael McCandless Fix For: 1.3 I have two Word docs, both containing the same drawing, but one has text added. In one case (picture.doc) the extraction is correct: it contains only an embedded image.wmf; when I view the image it's correct. In the second case (picture_3.doc) the picture is extracted as image (no extension), and is 0 bytes, and there is an invalid character (mapped to unicode replacement char) inserted before the image: {noformat} <title/> </head> <body><p>�<img src="embedded:image1" alt="image1"/></p> <p/> <p/> <p>vehicle </p> {noformat} (Though, the text "vehicle" is extracted correctly). I dug a bit, and with the 2nd doc there is an embedded {SHAPE * MERGEFORMAT} field, which we invoke WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts the 0-byte no-extension image as well as the invalid character. With the first doc there is no field (at least not one that's handle with handleSpecialCharacterRuns...). Otherwise I'm not sure how to fix... it could be something is going wrong in how POI parses the Pictures from PictureSource. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira