[ 
https://issues.apache.org/jira/browse/TIKA-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17914329#comment-17914329
 ] 

Tim Allison commented on TIKA-4369:
-----------------------------------

LOL... I had forgotten about that one. Shall we close this as "not a problem"?

> Pages extracted twice
> ---------------------
>
>                 Key: TIKA-4369
>                 URL: https://issues.apache.org/jira/browse/TIKA-4369
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, tika-app
>    Affects Versions: 1.27, 2.9.2, 3.0.0
>            Reporter: Tilman Hausherr
>            Priority: Major
>         Attachments: PDFBOX-4417-001031.pdf, result.htm, result.json, 
> result.txt
>
>
> Parts of pages 1 and 2 are extracted twice when I run tika-app with default 
> settings. This isn't new, it also happens with 1.27. The duplicate part 
> starts with "Improving Generic Drug Review Performance", after the content of 
> page 4. It doesn't happen with PDFBox extractText.
> I did some research for a few hours but didn't find anything. Before I start 
> digging deeper (e.g. in the PDFBox stripper), I wonder if there's something 
> obvious that I missed?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to