Olivier Ceulemans created TIKA-4443: ---------------------------------------
Summary: ClassCastException while extracting the text of a PDF Key: TIKA-4443 URL: https://issues.apache.org/jira/browse/TIKA-4443 Project: Tika Issue Type: Bug Components: parser Affects Versions: 3.2.0, 3.1.0, 3.0.0 Reporter: Olivier Ceulemans Attachments: 112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf A ClassCastException occurs when trying to extract the text of the attached PDF file with tika 3.2.0, 3.1.0, 3.0.0. I did not try previous versions. A simple way to reproduce the issue is to use the org.apache.tika.example.SimpleTextExtractor class of the tika-example library, part of the distribution. I also tried to use plain pdfbox without tika and the text can be extracted. That makes me assume that this could be a real issue rather than a corrupted PDF. Here is the stack trace: {color:#172b4d}Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2aa27288{color} {color:#172b4d} at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312){color} {color:#172b4d} at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color} {color:#172b4d} at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204){color} {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:525){color} {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:495){color} {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:594){color} {color:#172b4d} at org.apache.tika.example.SimpleTextExtractor.main(SimpleTextExtractor.java:32){color} {color:#172b4d}Caused by: java.lang.ClassCastException: class org.apache.pdfbox.cos.COSArray cannot be cast to class org.apache.pdfbox.cos.COSDictionary (org.apache.pdfbox.cos.COSArray and org.apache.pdfbox.cos.COSDictionary are in unnamed module of loader 'app'){color} {color:#172b4d} at org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:53){color} {color:#172b4d} at org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:30){color} {color:#172b4d} at org.apache.pdfbox.pdmodel.common.PDNameTreeNode.getNames(PDNameTreeNode.java:272){color} {color:#172b4d} at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:856){color} {color:#172b4d} at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:871){color} {color:#172b4d} at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractEmbeddedDocuments(AbstractPDF2XHTML.java:375){color} {color:#172b4d} at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:998){color} {color:#172b4d} at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:253){color} {color:#172b4d} at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107){color} {color:#172b4d} at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219){color} {color:#172b4d} at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color} {color:#172b4d} ... 6 more{color} {color:#172b4d}And here is the file that causes the issue:{color} [^112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)