[ https://issues.apache.org/jira/browse/TIKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-956: ------------------------------------ Attachment: TIKA-956.patch New patch w/ Jukka's suggestion... I think it's ready. > Embedded docs in Word doc are not inlined (text is always added to the end) > --------------------------------------------------------------------------- > > Key: TIKA-956 > URL: https://issues.apache.org/jira/browse/TIKA-956 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: TIKA-956.patch, TIKA-956.patch, TIKA-956.patch > > > You can see this with the recently added testWORD_embedded_pdf.doc > (for TIKA-948): the "Bye Bye" text comes before the "Wer > wjelrwoierj..." text from the embedded PDF, opposite of what you see > when you open the doc with Word. > Yet, the thumbnail images do seem to be extracted at the right place > (inlined). > This is because WordExtractor.java has a separate pass at the end to > visit the embedded docs. > Would it be possible to recurse into an embedded doc at the point when > it's first encountered instead...? Or maybe somehow correlate the > images with their corresponding attachment (right now they are just > named image1, image2, ...)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira