[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Tim Allison (JIRA) Fri, 09 Jan 2015 07:42:00 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271201#comment-14271201
 ]


Tim Allison commented on TIKA-1445:
-----------------------------------

No major problems found via quick and dirty govdocs1 eval.  

Let's roll!


Better:
Fewer pdf exceptions, better pdf text extraction (thank you, [~tilman]!)

"fixed exceptions": 2426 xls, 895 ppt, 158 pdf, 17 pps and 5 doc 

Note: "fixed exceptions" for xls are driven entirely by [~gagravarr]'s addition 
of parsing for xls .4.  Thank you, Nick!!!

More attachments for 27 pdf and 1 doc

More metadata values for all comparable file pairs (no exceptions, = number of 
attachments)

Areas for investigation:
"new exceptions" 27 xls
173 exceptions for newly added parsing of vnd.ms.excel.sheet.3
Fewer attachments for 19 ppt, 6 doc and 1 rtf
Permanent hangs/oom. These numbers differ by run because of multi-threading, 
but we went from 4 to 3.


I'll follow up with investigation of these issues and open appropriate tickets 
next week.


> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Blocker
>             Fix For: 1.7
>
>         Attachments: 000003.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
> TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
> TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to