[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045705#comment-14045705 ]
Timo Boehme commented on TIKA-1300: ----------------------------------- In general the NSP is much more specification conform than the classic one because it uses the xref table to find the corresponding object. This is especially important when it comes to changed/updated PDF documents which may be not part of the tested collection (?). Here the classic parser might simply use the wrong content. The drawback of the NSP therefore is that it is sensitive for a correct xref table. There is still a plan to add a xref-rebuild feature in case the table is broken. With this NSP should also be able to handle such PDF which currently are 'somehow' only handled by the classic parser. On the other hand there are lots of correct (!) PDF documents where the classic parser fails but the NSP is fine because they have some trash within the data which is not touched by any object referenced by xref table but the classic parser will read the trash and throws an exception. In summary: the NSP is much better on correct PDF; in case of broken xref table the classic parser might be able to parse 'something' (it need not be correct) where the NSP currently stops parsing (remark: I haven't had a deeper look into your result file yet) > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)