[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143588#comment-14143588 ]
Tim Allison edited comment on TIKA-1419 at 9/23/14 1:23 AM: ------------------------------------------------------------ I just finished the run on 50,000 random pdfs from govdocs1. With the move to PDFBox 1.8.7, we've gone from 53 exceptions down to 32. In manually reviewing the handful of docs with a token overlap < 0.80, there are quite a few improvements. It also looks like there may be some regressions in character mapping in several of the files. I'll submit issues for these over on PDFBox. Unless there are objections, I'll bump Tika to PDFBox 1.8.7. Unfortunately, the individual file links don't seem to be working today on the govdocs1 site. In the attached csv file, I've included those files that had exceptions in either 1.8.6 or 1.8.7 or have < 99% token overlap between the two versions of PDFBox. was (Author: talli...@mitre.org): I just finished the run on 50,000 random pdfs from govdocs1. With the move to PDFBox 1.8.7, we've gone from 53 exceptions down to 32. In manually reviewing the handful of docs with a token overlap < 0.80, there are quite a few improvements. It also looks like there may be some regressions in character mapping in several of the files. I'll submit issues for these over on PDFBox. Unless there are objections, I'll bump Tika to PDFBox 1.8.7. Unfortunately, the individual file links don't seem to be working today on the govdocs1 site. > Upgrade to PDFBox 1.8.7 > ----------------------- > > Key: TIKA-1419 > URL: https://issues.apache.org/jira/browse/TIKA-1419 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv > > > Will run against govdocs1 early next week and then upgrade if no major > regressions are found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)