[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152855#comment-14152855 ]
Tilman Hausherr commented on TIKA-1419: --------------------------------------- Compare PDFBox's trunk against 1.8.x periodically would make sense, of course. There's a comment in PDFBOX-2377 "the current trunk extracts nothing but rubbish from 705042.pdf" so this makes me wonder what else has been "lost" in the trunk. Re checking 1.8.8 v. 1.8.6 - if it isn't too much work, as soon as you have the time, even if there isn't a new release planned now. The regression you found is very embarassing, and it is the first time I realize that a wrong decision in the recognition of inline images (detecting whether "EI" is within an image or is the end of the image) results in cut off text extraction. > Upgrade to PDFBox 1.8.7 > ----------------------- > > Key: TIKA-1419 > URL: https://issues.apache.org/jira/browse/TIKA-1419 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, > compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx > > > Will run against govdocs1 early next week and then upgrade if no major > regressions are found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)