[ https://issues.apache.org/jira/browse/TIKA-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-956: ------------------------------------ Attachment: TIKA-956.patch A wee bit of progress here, after much sleuthing around POI's sources, with a patch to POI, not to Tika. With the attached patch, if you run org.apache.poi.hwpf.converter.WordToTextConverter on the attached word doc, you get this text output: Here is the pdf file: $$ embedded: _1402837031 $$ Bye Bye The output shows where the embedded file appears in the text ... I did this by adding a processEmbedded call to the "Embedded Object" case in processField (currently it just passes the "name" = separator.getPicOffset() as the argument). This is obviously not committable ... but it at least demonstrates it's possible to show the location in the text where the embedded object occurs. The challenge now is how to do something similar in Tika. > Embedded docs in Word doc are not inlined (text is always added to the end) > --------------------------------------------------------------------------- > > Key: TIKA-956 > URL: https://issues.apache.org/jira/browse/TIKA-956 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.2 > Reporter: Michael McCandless > Attachments: TIKA-956.patch > > > You can see this with the recently added testWORD_embedded_pdf.doc > (for TIKA-948): the "Bye Bye" text comes before the "Wer > wjelrwoierj..." text from the embedded PDF, opposite of what you see > when you open the doc with Word. > Yet, the thumbnail images do seem to be extracted at the right place > (inlined). > This is because WordExtractor.java has a separate pass at the end to > visit the embedded docs. > Would it be possible to recurse into an embedded doc at the point when > it's first encountered instead...? Or maybe somehow correlate the > images with their corresponding attachment (right now they are just > named image1, image2, ...)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira