[jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

Tim Allison (JIRA) Thu, 26 Jun 2014 09:53:42 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-1300:
------------------------------

    Attachment: tika_1_6_ClassicsVsNonSeq.zip

The attached shows the results of running Tika 1.6 trunk with PDFBox 1.8.6 on a 
random selection of 10,000 govdocs1 pdfs.  We used the default (do not extract 
images) setting.  

On one run, we used the default classic parser, and on the other we used the 
new (and future classic) NonSequential Parser (NSP).

Both parsers shared 11 exceptions.  The NSP had 24 exceptions that the classic 
parser did not have, and the classic parser had no exceptions that the NSP did 
not also have.

The contents of the extracted text (at least by unigram token counts), number 
of attachments and number of metadata features were nearly identical.  There 
were only two files where the number of tokens varied and that was very, very 
slightly.

The difference in speed was not operationally noticeable:
median per file: 96 millis for classic
median 93 millis for NSP
average per file: 264 millis for classic
average per file: 269 millis for NSP

Given that there were more exceptions with the NSP (admittedly a very small 
number), I'm hesitant to change the default parser within Tika to NSP...unless 
there are benefits that I'm not taking into consideration.

This corpus clearly has limitations.

Any thoughts or other benchmarks we should consider?

> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

Reply via email to