[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271201#comment-14271201 ]
Tim Allison commented on TIKA-1445: ----------------------------------- No major problems found via quick and dirty govdocs1 eval. Let's roll! Better: Fewer pdf exceptions, better pdf text extraction (thank you, [~tilman]!) "fixed exceptions": 2426 xls, 895 ppt, 158 pdf, 17 pps and 5 doc Note: "fixed exceptions" for xls are driven entirely by [~gagravarr]'s addition of parsing for xls .4. Thank you, Nick!!! More attachments for 27 pdf and 1 doc More metadata values for all comparable file pairs (no exceptions, = number of attachments) Areas for investigation: "new exceptions" 27 xls 173 exceptions for newly added parsing of vnd.ms.excel.sheet.3 Fewer attachments for 19 ppt, 6 doc and 1 rtf Permanent hangs/oom. These numbers differ by run because of multi-threading, but we went from 4 to 3. I'll follow up with investigation of these issues and open appropriate tickets next week. > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Priority: Blocker > Fix For: 1.7 > > Attachments: 000003.doc, TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, > TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, > TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)