Tilman Hausherr created TIKA-4369: ------------------------------------- Summary: Pages extracted twice Key: TIKA-4369 URL: https://issues.apache.org/jira/browse/TIKA-4369 Project: Tika Issue Type: Bug Components: parser, tika-app Affects Versions: 3.0.0, 2.9.2, 1.27 Reporter: Tilman Hausherr Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json, result.txt
Parts of pages 1 and 2 are extracted twice when I run tika-app with default settings. This isn't new, it also happens with 1.27. The duplicate part starts with "Improving Generic Drug Review Performance", after the content of page 4. It doesn't happen with PDFBox extractText. I did some research for a few hours but didn't find anything. Before I start digging deeper (e.g. in the PDFBox stripper), I wonder if there's something obvious that I missed? -- This message was sent by Atlassian Jira (v8.20.10#820010)