[ https://issues.apache.org/jira/browse/TIKA-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330824#comment-17330824 ]
David Pilato commented on TIKA-3364: ------------------------------------ So I trie this: {code:java} PDFParser pdfParser = new PDFParser(); DefaultParser defaultParser; pdfParser.setExtractAnnotationText(false); if (!fs.getOcr().isEnabled()) { logger.debug("OCR is disabled. Even though it's detected, it must be disabled explicitly"); defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(TesseractOCRParser.class)); } else { logger.debug("OCR is activated."); if (ExternalParser.check("tesseract")) { logger.debug("OCR strategy for PDF documents is [{}] and tesseract was found.", fs.getOcr().getPdfStrategy()); pdfParser.setOcrStrategy(fs.getOcr().getPdfStrategy()); } else { logger.debug("But Tesseract is not installed so we won't run OCR."); pdfParser.setOcrStrategy("no_ocr"); } defaultParser = new DefaultParser( MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(), Collections.singletonList(PDFParser.class)); } parser = new AutoDetectParser(defaultParser, pdfParser); {code} And it seems to be producing the same effect. I'm probably missing something. When I run it with this configuration, the extracted text is actually: {code:txt} \nDummy PDF file\n\nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} So the text is extracted 3 times. When I disable OCR with {{pdfParser.setOcrStrategy("no_ocr")}}, I'm getting: {code:txt} \nDummy PDF file\n\n\n\tDummy PDF file\n\n {code} > PDF Content is extracted twice > ------------------------------ > > Key: TIKA-3364 > URL: https://issues.apache.org/jira/browse/TIKA-3364 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.26 > Reporter: David Pilato > Priority: Major > Attachments: Screenshot from 2021-04-23 10-15-22.png, issue-1097.pdf, > tika-bookmarks-config.xml > > > Hi > Coming from [this issue in FSCrawler > project|https://github.com/dadoonet/fscrawler/issues/1097], I can see that > the text from the PDF document is extracted more than once although PDFBox > seems to extract it only once. > I attached the PDF. > When I run: > {code:sh} > wget https://downloads.apache.org/pdfbox/2.0.23/pdfbox-app-2.0.23.jar > java -jar pdfbox-app-2.0.23.jar ExtractText -console issue-1097.pdf > {code} > I'm getting: > {code:sh} > Dummy PDF file > {code} > But with Tika: > {code:sh} > wget https://downloads.apache.org/tika/tika-app-1.26.jar > java -jar tika-app-1.26.jar > {code} > I'm getting: > {code:xml} > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="pdf:PDFVersion" content="1.4"/> > <meta name="xmp:CreatorTool" content="Writer"/> > <meta name="pdf:hasXFA" content="false"/> > <meta name="access_permission:modify_annotations" content="true"/> > <meta name="access_permission:can_print_degraded" content="true"/> > <meta name="dc:creator" content="Evangelos Vlachogiannis"/> > <meta name="dcterms:created" content="2007-02-23T15:56:37Z"/> > <meta name="dc:format" content="application/pdf; version=1.4"/> > <meta name="pdf:docinfo:creator_tool" content="Writer"/> > <meta name="access_permission:fill_in_form" content="true"/> > <meta name="pdf:encrypted" content="false"/> > <meta name="Content-Length" content="13264"/> > <meta name="X-TIKA:digest:MD5" content="2942bfabb3d05332b66eb128e0842cff"/> > <meta name="pdf:hasMarkedContent" content="false"/> > <meta name="Content-Type" content="application/pdf"/> > <meta name="pdf:docinfo:creator" content="Evangelos Vlachogiannis"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> > <meta name="creator" content="Evangelos Vlachogiannis"/> > <meta name="meta:author" content="Evangelos Vlachogiannis"/> > <meta name="meta:creation-date" content="2007-02-23T15:56:37Z"/> > <meta name="created" content="2007-02-23T15:56:37Z"/> > <meta name="X-TIKA:digest:SHA256" > content="3df79d34abbca99308e79cb94461c1893582604d68329a41fd4bec1885e6adb4"/> > <meta name="access_permission:extract_for_accessibility" content="true"/> > <meta name="access_permission:assemble_document" content="true"/> > <meta name="xmpTPg:NPages" content="1"/> > <meta name="Creation-Date" content="2007-02-23T15:56:37Z"/> > <meta name="resourceName" content="issue-1097.pdf"/> > <meta name="pdf:hasXMP" content="false"/> > <meta name="access_permission:extract_content" content="true"/> > <meta name="access_permission:can_print" content="true"/> > <meta name="Author" content="Evangelos Vlachogiannis"/> > <meta name="producer" content="OpenOffice.org 2.1"/> > <meta name="access_permission:can_modify" content="true"/> > <meta name="pdf:docinfo:producer" content="OpenOffice.org 2.1"/> > <meta name="pdf:docinfo:created" content="2007-02-23T15:56:37Z"/> > <title/> > </head> > <body><div class="page"><p/> > <p>Dummy PDF file</p> > <p/> > </div> > <ul> <li>Dummy PDF file</li> > </ul> > </body></html> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)