[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143588#comment-14143588
 ] 

Tim Allison edited comment on TIKA-1419 at 9/23/14 1:23 AM:
------------------------------------------------------------

I just finished the run on 50,000 random pdfs from govdocs1.  With the move to 
PDFBox 1.8.7, we've gone from 53 exceptions down to 32.  In manually reviewing 
the handful of docs with a token overlap < 0.80, there are quite a few 
improvements.  It also looks like there may be some regressions in character 
mapping in several of the files.  I'll submit issues for these over on PDFBox.  
Unless there are objections, I'll bump Tika to PDFBox 1.8.7.

Unfortunately, the individual file links don't seem to be working today on the 
govdocs1 site.

In the attached csv file, I've included those files that had exceptions in 
either 1.8.6 or 1.8.7 or have < 99% token overlap between the two versions of 
PDFBox.


was (Author: talli...@mitre.org):
I just finished the run on 50,000 random pdfs from govdocs1.  With the move to 
PDFBox 1.8.7, we've gone from 53 exceptions down to 32.  In manually reviewing 
the handful of docs with a token overlap < 0.80, there are quite a few 
improvements.  It also looks like there may be some regressions in character 
mapping in several of the files.  I'll submit issues for these over on PDFBox.  
Unless there are objections, I'll bump Tika to PDFBox 1.8.7.

Unfortunately, the individual file links don't seem to be working today on the 
govdocs1 site.

> Upgrade to PDFBox 1.8.7
> -----------------------
>
>                 Key: TIKA-1419
>                 URL: https://issues.apache.org/jira/browse/TIKA-1419
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to