[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
------------------------------
    Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx

This file compares PDFBox 1.8.8-SNAPSHOT-b145 with the classic parser vs the 
NonSequential parser.  I've only included the files that had any diffs in 
extracted content, attachments or metadata.

There is one fewer exception with the NonSeq and a few handfuls of new 
exceptions.

Text extraction looks to be mixed, with some better and some worse.  Note, 
though, that there are only 94 files with exceptions or any amount of 
difference out of 50,000 pdfs.

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.8
>
>         Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to