[ https://issues.apache.org/jira/browse/TIKA-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914329#comment-17914329 ]
Tim Allison commented on TIKA-4369: ----------------------------------- LOL... I had forgotten about that one. Shall we close this as "not a problem"? > Pages extracted twice > --------------------- > > Key: TIKA-4369 > URL: https://issues.apache.org/jira/browse/TIKA-4369 > Project: Tika > Issue Type: Bug > Components: parser, tika-app > Affects Versions: 1.27, 2.9.2, 3.0.0 > Reporter: Tilman Hausherr > Priority: Major > Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json, > result.txt > > > Parts of pages 1 and 2 are extracted twice when I run tika-app with default > settings. This isn't new, it also happens with 1.27. The duplicate part > starts with "Improving Generic Drug Review Performance", after the content of > page 4. It doesn't happen with PDFBox extractText. > I did some research for a few hours but didn't find anything. Before I start > digging deeper (e.g. in the PDFBox stripper), I wonder if there's something > obvious that I missed? -- This message was sent by Atlassian Jira (v8.20.10#820010)