[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ]
Tilman Hausherr commented on TIKA-1300: --------------------------------------- My impression was that the NSP had better results for good PDF files. I'm surprised that the old parser has less problems - but then, the first two files of the list had incorrect Xref tables. The old parser just reads through the stuff even if the xref table is crap. I wonder if both parsers should in a team, i.e. try the first one, and if there is an exception, try the 2nd one. Anyway, when I'm bored, I'll have a look at the files in the list. > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)