[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1300: ------------------------------ Attachment: tika_1_6_ClassicsVsNonSeq.zip The attached shows the results of running Tika 1.6 trunk with PDFBox 1.8.6 on a random selection of 10,000 govdocs1 pdfs. We used the default (do not extract images) setting. On one run, we used the default classic parser, and on the other we used the new (and future classic) NonSequential Parser (NSP). Both parsers shared 11 exceptions. The NSP had 24 exceptions that the classic parser did not have, and the classic parser had no exceptions that the NSP did not also have. The contents of the extracted text (at least by unigram token counts), number of attachments and number of metadata features were nearly identical. There were only two files where the number of tokens varied and that was very, very slightly. The difference in speed was not operationally noticeable: median per file: 96 millis for classic median 93 millis for NSP average per file: 264 millis for classic average per file: 269 millis for NSP Given that there were more exceptions with the NSP (admittedly a very small number), I'm hesitant to change the default parser within Tika to NSP...unless there are benefits that I'm not taking into consideration. This corpus clearly has limitations. Any thoughts or other benchmarks we should consider? > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)