[ https://issues.apache.org/jira/browse/TIKA-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913307#comment-17913307 ]
Hudson commented on TIKA-4363: ------------------------------ SUCCESS: Integrated in Jenkins build Tika » tika-branch_2x-jdk11 #585 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch_2x-jdk11/585/]) TIKA-4363: refactor (tilman: [https://github.com/apache/tika/commit/636f57b40ad610f5dfbc8dce203a0b251ccff56d]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFMarkedContent2XHTML.java > Duplicate text when OCR and extractMarkedContent (PDFParserConfig) enabled > -------------------------------------------------------------------------- > > Key: TIKA-4363 > URL: https://issues.apache.org/jira/browse/TIKA-4363 > Project: Tika > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Alexey Pismenskiy > Assignee: Tim Allison > Priority: Major > Attachments: MarkedPdfDuplicateTextWithTesseract.pdf, > tika-conf-override.xml > > > Extracting a marked PDF while OCR (Tesseract) and extractMarkedContent is > enabled and the ocrStrategy is defaulting to "auto" within the PDFParser is > causing duplicate text extraction. > Attached are example of the configuration and marked PDF file that can > reproduce the issue with the following test: > {{@Test}} > {{public void testPDFDuplicate() throws Exception {}} > {{ String tikaConfigFileName = "/test-documents/tika-conf-override.xml";}} > {{ TikaConfig tikaConfig = new > TikaConfig(getClass().getResourceAsStream(tikaConfigFileName));}} > {{ Tika tika = new Tika(tikaConfig);}} > {{ String issueFile = > "/test-documents/MarkedPdfDuplicateTextWithTesseract.pdf";}} > {{ URL resource = getClass().getResource(issueFile);}} > {{ assert resource != null;}} > {{ try (InputStream issueStream = resource.openStream()) {}} > {{ String issueContent = tika.parseToString(issueStream);}} > {{ System.out.println(issueContent);}} > {{ assertTrue(issueContent.contains("aabb6ba1-34ab-4af2"));}} > {{ assertEquals(1, StringUtils.countMatches(issueContent, > "aabb6ba1-34ab-4af2"), "Does not contain the expected number of > occurrences");}} > {{}}} > > PDFParser.java:214 > * This is where it checks for the extractMarkedContent flag and will go into > the PDFMarkedContent2XHTML class. > > AbstractPDF2XHTML.java:791 - 806 > * In this code, the totalCharsPerPage was never updated by the > PDFMarkedContent2XHTML and therefore matches the conditions to perform OCR on > the PDF even though text has been extracted. > One thing to note, if we turn off extractMarkedContent, then it goes into > PDF2XHTML on PDFParser.java:219 and the variable totalCharsPerPage gets > updated properly. > {{ }} > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)