[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225867#comment-14225867
 ] 

Tilman Hausherr commented on TIKA-1442:
---------------------------------------

Ok, will do.
About the seq vs. nonSeq test: this will take some more time to understand, 
I've already opened PDFBOX-2523 for the problem that comes most.
However I've also seen files where the non sequential parser has one page more, 
e.g. 535691.pdf, 352706.pdf and 212019.pdf.
About "testing the full 250k": Hmmm.... not now. Most, if not all, of the 
differences will be similar to the ones found in the current subset, of the 
kind I already have, e.g. pages with trash text extraction where the trash is 
different.

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.8
>
>         Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to