[ https://issues.apache.org/jira/browse/TIKA-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985893#comment-17985893 ]
Tilman Hausherr commented on TIKA-4443: --------------------------------------- Here's a bigger screenshot that shows more of the problem and another problem. Note the WKANCHOR items at two places, once at the correct place and once at the wrong place. Note also the Info dictionary that exists twice. Once at the correct place, and once in the structure tree. What I think happened is that a merge went wrong. There is a cover page and the "MASTERTEC" data sheet. I hope that this merge wasn't created by PDFBox, we had some similar problems in the early 3.0 versions, although I can't remember whether it happened with the merge utility itself (but it happened with some operations that included combining PDF files). !screenshot-2.png! > ClassCastException while extracting the text of a PDF > ----------------------------------------------------- > > Key: TIKA-4443 > URL: https://issues.apache.org/jira/browse/TIKA-4443 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.0.0, 3.1.0, 3.2.0 > Reporter: Olivier Ceulemans > Priority: Minor > Attachments: 112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf, > screenshot-1.png, screenshot-2.png > > > A ClassCastException occurs when trying to extract the text of the attached > PDF file with tika 3.2.0, 3.1.0, 3.0.0. I did not try previous versions. > A simple way to reproduce the issue is to use the > org.apache.tika.example.SimpleTextExtractor class of the tika-example > library, part of the distribution. > I also tried to use plain pdfbox without tika and the text can be extracted. > That makes me assume that this could be a real issue rather than a corrupted > PDF. > Here is the stack trace: > {color:#172b4d}Exception in thread "main" > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.pdf.PDFParser@2aa27288{color} > {color:#172b4d} at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312){color} > {color:#172b4d} at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color} > {color:#172b4d} at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204){color} > {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:525){color} > {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:495){color} > {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:594){color} > {color:#172b4d} at > org.apache.tika.example.SimpleTextExtractor.main(SimpleTextExtractor.java:32){color} > {color:#172b4d}Caused by: java.lang.ClassCastException: class > org.apache.pdfbox.cos.COSArray cannot be cast to class > org.apache.pdfbox.cos.COSDictionary (org.apache.pdfbox.cos.COSArray and > org.apache.pdfbox.cos.COSDictionary are in unnamed module of loader > 'app'){color} > {color:#172b4d} at > org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:53){color} > {color:#172b4d} at > org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:30){color} > {color:#172b4d} at > org.apache.pdfbox.pdmodel.common.PDNameTreeNode.getNames(PDNameTreeNode.java:272){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:856){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:871){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractEmbeddedDocuments(AbstractPDF2XHTML.java:375){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:998){color} > {color:#172b4d} at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:253){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219){color} > {color:#172b4d} at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color} > {color:#172b4d} ... 6 more{color} > > {color:#172b4d}And here is the file that causes the issue:{color} > [^112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)