[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

Tim Allison (JIRA) Wed, 24 Sep 2014 04:45:53 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146230#comment-14146230
 ]


Tim Allison commented on TIKA-1396:
-----------------------------------

Ah, ok.  Y, pls open another issue.  I should also add meta tags to the 
RTFParser while I'm at it.  The model I should use is from the microsoft 
parsers?

{noformat}
AttributesImpl attributes = new AttributesImpl();
attributes.addAttribute("", "class", "class", "CDATA", "embedded");
attributes.addAttribute("", "id", "id", "CDATA", id);
xhtml.startElement("div", attributes);
xhtml.endElement("div");
{noformat}

For the PDFParser, the inline images are extracted at the "bottom" of each 
page, not the actual coordinates, and regular attachments are extracted at the 
end of the document.  Will this wreck your processing?

> Embedded images in PDF documents
> --------------------------------
>
>                 Key: TIKA-1396
>                 URL: https://issues.apache.org/jira/browse/TIKA-1396
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>            Reporter: Damiano
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

Reply via email to