[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045705#comment-14045705
 ] 

Timo Boehme commented on TIKA-1300:
-----------------------------------

In general the NSP is much more specification conform than the classic one 
because it uses the xref table to find the corresponding object. This is 
especially important when it comes to changed/updated PDF documents which may 
be not part of the tested collection (?). Here the classic parser might simply 
use the wrong content. The drawback of the NSP therefore is that it is 
sensitive for a correct xref table. There is still a plan to add a xref-rebuild 
feature in case the table is broken. With this NSP should also be able to 
handle such PDF which currently are 'somehow' only handled by the classic 
parser.
On the other hand there are lots of correct (!) PDF documents where the classic 
parser fails but the NSP is fine because they have some trash within the data 
which is not touched by any object referenced by xref table but the classic 
parser will read the trash and throws an exception.

In summary: the NSP is much better on correct PDF; in case of broken xref table 
the classic parser might be able to parse 'something' (it need not be correct) 
where the NSP currently stops parsing

(remark: I haven't had a deeper look into your result file yet)

> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to