[ https://issues.apache.org/jira/browse/TIKA-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr resolved TIKA-4443. ----------------------------------- Fix Version/s: 4.0.0 3.2.2 Assignee: Tilman Hausherr Resolution: Fixed It's fixed now in PDFBox. You'll profit only after the next PDFBox version has been released and tika too. Alternatively you'd need to take a snapshot build of PDFBox and combine that with tika. > ClassCastException while extracting the text of a PDF > ----------------------------------------------------- > > Key: TIKA-4443 > URL: https://issues.apache.org/jira/browse/TIKA-4443 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.0.0, 3.1.0, 3.2.0 > Reporter: Olivier Ceulemans > Assignee: Tilman Hausherr > Priority: Minor > Fix For: 4.0.0, 3.2.2 > > Attachments: 112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf, > screenshot-1.png, screenshot-2.png > > > A ClassCastException occurs when trying to extract the text of the attached > PDF file with tika 3.2.0, 3.1.0, 3.0.0. I did not try previous versions. > A simple way to reproduce the issue is to use the > org.apache.tika.example.SimpleTextExtractor class of the tika-example > library, part of the distribution. > I also tried to use plain pdfbox without tika and the text can be extracted. > That makes me assume that this could be a real issue rather than a corrupted > PDF. > Here is the stack trace: > {color:#172b4d}Exception in thread "main" > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.pdf.PDFParser@2aa27288{color} > {color:#172b4d} at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312){color} > {color:#172b4d} at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color} > {color:#172b4d} at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204){color} > {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:525){color} > {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:495){color} > {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:594){color} > {color:#172b4d} at > org.apache.tika.example.SimpleTextExtractor.main(SimpleTextExtractor.java:32){color} > {color:#172b4d}Caused by: java.lang.ClassCastException: class > org.apache.pdfbox.cos.COSArray cannot be cast to class > org.apache.pdfbox.cos.COSDictionary (org.apache.pdfbox.cos.COSArray and > org.apache.pdfbox.cos.COSDictionary are in unnamed module of loader > 'app'){color} > {color:#172b4d} at > org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:53){color} > {color:#172b4d} at > org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:30){color} > {color:#172b4d} at > org.apache.pdfbox.pdmodel.common.PDNameTreeNode.getNames(PDNameTreeNode.java:272){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:856){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:871){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractEmbeddedDocuments(AbstractPDF2XHTML.java:375){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:998){color} > {color:#172b4d} at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:253){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107){color} > {color:#172b4d} at > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219){color} > {color:#172b4d} at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color} > {color:#172b4d} ... 6 more{color} > > {color:#172b4d}And here is the file that causes the issue:{color} > [^112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010)