[ https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075305#comment-14075305 ]
Hudson commented on TIKA-1374: ------------------------------ SUCCESS: Integrated in tika-trunk-jdk1.6 #116 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/116/]) TIKA-1374: Try to extract OS-specific embedded files within PDFs (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613501) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > Need to add code to look for OS-specific keys for embedded files within PDFs > ---------------------------------------------------------------------------- > > Key: TIKA-1374 > URL: https://issues.apache.org/jira/browse/TIKA-1374 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.6 > > > Embedded files in PDFs can be found by the general all purpose key we > currently use via PDFBox: "F". However, embedded documents can also be > stored under OS specific keys: "DOS", "Mac" and "Unix". > [~lehmi] confirmed on the PDFBox users > [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] > that we might be missing embedded documents if we're not trying the OS > specific keys as well. As Andreas points out, according to the spec the OS > specific keys shouldn't be used any more, but I think we should support > extraction for them. > My proposal is to pull all documents that are available by any of the four > keys (well, via getEmbeddedFile<OS>() in PDFBox). This has the downside of > potentially extracting basically duplicate documents, but I'd prefer to err > on the side of extracting everything. > The code fix is trivial, and I'll try to commit it today. However, it will > take me a bit of time to generate a test file that stores files under the OS > specific keys. So, if anyone has an ASF-friendly file available or wants to > take the task of generating one, please do. -- This message was sent by Atlassian JIRA (v6.2#6252)