[jira] [Commented] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

Hudson (JIRA) Sat, 26 Jul 2014 01:31:15 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075305#comment-14075305
 ]


Hudson commented on TIKA-1374:
------------------------------

SUCCESS: Integrated in tika-trunk-jdk1.6 #116 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/116/])
TIKA-1374: Try to extract OS-specific embedded files within PDFs (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613501)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


> Need to add code to look for OS-specific keys for embedded files within PDFs
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-1374
>                 URL: https://issues.apache.org/jira/browse/TIKA-1374
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.6
>
>
> Embedded files in PDFs can be found by the general all purpose key we  
> currently use via PDFBox:  "F".  However, embedded documents can also be 
> stored under OS specific keys: "DOS", "Mac" and "Unix".
> [~lehmi] confirmed on the PDFBox users 
> [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e]
>  that we might be missing embedded documents if we're not trying the OS 
> specific keys as well.  As Andreas points out, according to the spec the OS 
> specific keys shouldn't be used any more, but I think we should support 
> extraction for them.
> My proposal is to pull all documents that are available by any of the four 
> keys (well, via getEmbeddedFile<OS>() in PDFBox).  This has the downside of 
> potentially extracting basically duplicate documents, but I'd prefer to err 
> on the side of extracting everything.
> The code fix is trivial, and I'll try to commit it today.  However, it will 
> take me a bit of time to generate a test file that stores files under the OS 
> specific keys.  So, if anyone has an ASF-friendly file available or wants to 
> take the task of generating one, please do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

Reply via email to