[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

Tim Allison (JIRA) Mon, 15 Sep 2014 05:36:46 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133841#comment-14133841
 ]


Tim Allison commented on TIKA-1396:
-----------------------------------

Now that we are using PDFBox 1.8.6, we might consider setting the default 
behavior for PDFs to extract inline images.  In TIKA-1294, we turned the 
default behavior to "don't extract inline images" because of performance issues 
with PDFBox 1.8.5.

The upside: Tika will behave as expected...extracting embedded images just like 
other attachments in other documents.  Some users won't be surprised.

The downside: PDF images might not be useful or desired by some users.  In at 
least one PDF in govdocs1, each row in a table is its own image...my takeaway 
from this one rare example is that crazy things can happen with images in PDFs. 
Even with the modifications made in PDFBox 1.8.6, extraction can still take up 
quite a few resources/time.  Enterprise-scale users will be surprised by the 
sudden performance degradation if we make this change.

One response: basic users should basically get what they expect.  
Enterprise-scale users should know what they're doing and be able to turn this 
off if they don't want it.

Thoughts?

> Embedded images in PDF documents
> --------------------------------
>
>                 Key: TIKA-1396
>                 URL: https://issues.apache.org/jira/browse/TIKA-1396
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>            Reporter: Damiano
>            Priority: Critical
>             Fix For: 1.6
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

Reply via email to