[ https://issues.apache.org/jira/browse/TIKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-956: ------------------------------------ Attachment: TIKA-956.patch Patch w/ new test case, fixing Tika to leave a <div embedded="_NNNNNNN"> at the place where the embedded document actually occurs. I'm not sure if that's the best tag to produce... > Embedded docs in Word doc are not inlined (text is always added to the end) > --------------------------------------------------------------------------- > > Key: TIKA-956 > URL: https://issues.apache.org/jira/browse/TIKA-956 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: TIKA-956.patch, TIKA-956.patch > > > You can see this with the recently added testWORD_embedded_pdf.doc > (for TIKA-948): the "Bye Bye" text comes before the "Wer > wjelrwoierj..." text from the embedded PDF, opposite of what you see > when you open the doc with Word. > Yet, the thumbnail images do seem to be extracted at the right place > (inlined). > This is because WordExtractor.java has a separate pass at the end to > visit the embedded docs. > Would it be possible to recurse into an embedded doc at the point when > it's first encountered instead...? Or maybe somehow correlate the > images with their corresponding attachment (right now they are just > named image1, image2, ...)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira