[ https://issues.apache.org/jira/browse/TIKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419342#comment-13419342 ]
Michael McCandless commented on TIKA-956: ----------------------------------------- Alas I don't see how to determine where an embedded document is actually inserted into the main document... but I know very little about the POI APIs. Does anyone have any hints/pointers here? > Embedded docs in Word doc are not inlined (text is always added to the end) > --------------------------------------------------------------------------- > > Key: TIKA-956 > URL: https://issues.apache.org/jira/browse/TIKA-956 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.2 > Reporter: Michael McCandless > > You can see this with the recently added testWORD_embedded_pdf.doc > (for TIKA-948): the "Bye Bye" text comes before the "Wer > wjelrwoierj..." text from the embedded PDF, opposite of what you see > when you open the doc with Word. > Yet, the thumbnail images do seem to be extracted at the right place > (inlined). > This is because WordExtractor.java has a separate pass at the end to > visit the embedded docs. > Would it be possible to recurse into an embedded doc at the point when > it's first encountered instead...? Or maybe somehow correlate the > images with their corresponding attachment (right now they are just > named image1, image2, ...)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira