Olivier Ceulemans created TIKA-4443:
---------------------------------------

             Summary: ClassCastException while extracting the text of a PDF
                 Key: TIKA-4443
                 URL: https://issues.apache.org/jira/browse/TIKA-4443
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 3.2.0, 3.1.0, 3.0.0
            Reporter: Olivier Ceulemans
         Attachments: 112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf

A ClassCastException occurs when trying to extract the text of the attached PDF 
file with tika 3.2.0, 3.1.0, 3.0.0. I did not try previous versions.

A simple way to reproduce the issue is to use the 
org.apache.tika.example.SimpleTextExtractor class of the tika-example library, 
part of the distribution.

I also tried to use plain pdfbox without tika and the text can be extracted. 
That makes me assume that this could be a real issue rather than a corrupted 
PDF.

Here is the stack trace:
{color:#172b4d}Exception in thread "main" 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@2aa27288{color}

{color:#172b4d} at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312){color}

{color:#172b4d} at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color}

{color:#172b4d} at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204){color}

{color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:525){color}

{color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:495){color}

{color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:594){color}

{color:#172b4d} at 
org.apache.tika.example.SimpleTextExtractor.main(SimpleTextExtractor.java:32){color}

{color:#172b4d}Caused by: java.lang.ClassCastException: class 
org.apache.pdfbox.cos.COSArray cannot be cast to class 
org.apache.pdfbox.cos.COSDictionary (org.apache.pdfbox.cos.COSArray and 
org.apache.pdfbox.cos.COSDictionary are in unnamed module of loader 
'app'){color}

{color:#172b4d} at 
org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:53){color}

{color:#172b4d} at 
org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:30){color}

{color:#172b4d} at 
org.apache.pdfbox.pdmodel.common.PDNameTreeNode.getNames(PDNameTreeNode.java:272){color}

{color:#172b4d} at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:856){color}

{color:#172b4d} at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:871){color}

{color:#172b4d} at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractEmbeddedDocuments(AbstractPDF2XHTML.java:375){color}

{color:#172b4d} at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:998){color}

{color:#172b4d} at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:253){color}

{color:#172b4d} at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107){color}

{color:#172b4d} at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219){color}

{color:#172b4d} at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color}

{color:#172b4d} ... 6 more{color}

 

{color:#172b4d}And here is the file that causes the issue:{color}

[^112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to