[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1419:
----------------------------------
    Attachment: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx

Here's an excel file, on the new column on the right I wrote which files 
improved by solving the three related PDFBox issues above. I mostly tested the 
files that had less tokens. I tested a few that had more tokens, there the 
results are inconclusive. Some have improved, some had more tokens due to a 
regression that has been solved now.

Would it be possible, the next time, to test with the same set of files, and 
test not 1.8.8 against 1.8.7, but rather 1.8.8 against 1.8.6? The reason is 
that if there's an unknown regression in 1.8.7, and this isn't solved, 1.8.8 
would look as if there's the same quality, but it is not.

> Upgrade to PDFBox 1.8.7
> -----------------------
>
>                 Key: TIKA-1419
>                 URL: https://issues.apache.org/jira/browse/TIKA-1419
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, 
> compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to