[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045718#comment-14045718 ]
Maruan Sahyoun commented on TIKA-1300: -------------------------------------- Thanks for doing such test. The good part is that for the vast majority of files there are no exceptions. To get some additional information I did a very quick test with 101819.pdf, 200939.pdf, 491579.pdf, 556251.pdf, 710399.pdf, 148186.pdf, 231828.pdf, 527566.pdf, 630287.pdf, 167424.pdf, 277375.pdf, 545359.pdf, 702923.pdf for which there were exceptions in the NSP. Results: # 556251.pdf couldn’t be opened by Adobe Reader # 167424.pdf, 231828.pdf, 527566.pdf, 630287.pdf, 702923.pdf, 710399.pdf, 101819.pdf, 491579.pdf needed to be repaired by Adobe Reader. # 101819.pdf, 491579.pdf in addition a dialog box came about missing content/issues when opened in Adobe Reader. # 141816.pdf, 200939.pdf, 277375.pdf, 545359.pdf had no complaints in Adobe Reader. # 141816.pdf is decrypted and had password security applied not permitting text extraction. The Exception in the NSP is related to not having supplied a password. Extracting text in Adobe Reader is disabled. > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)