[ 
https://issues.apache.org/jira/browse/TIKA-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved TIKA-4443.
-----------------------------------
    Fix Version/s: 4.0.0
                   3.2.2
         Assignee: Tilman Hausherr
       Resolution: Fixed

It's fixed now in PDFBox. You'll profit only after the next PDFBox version has 
been released and tika too. Alternatively you'd need to take a snapshot build 
of PDFBox and combine that with tika.

> ClassCastException while extracting the text of a PDF
> -----------------------------------------------------
>
>                 Key: TIKA-4443
>                 URL: https://issues.apache.org/jira/browse/TIKA-4443
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.0.0, 3.1.0, 3.2.0
>            Reporter: Olivier Ceulemans
>            Assignee: Tilman Hausherr
>            Priority: Minor
>             Fix For: 4.0.0, 3.2.2
>
>         Attachments: 112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf, 
> screenshot-1.png, screenshot-2.png
>
>
> A ClassCastException occurs when trying to extract the text of the attached 
> PDF file with tika 3.2.0, 3.1.0, 3.0.0. I did not try previous versions.
> A simple way to reproduce the issue is to use the 
> org.apache.tika.example.SimpleTextExtractor class of the tika-example 
> library, part of the distribution.
> I also tried to use plain pdfbox without tika and the text can be extracted. 
> That makes me assume that this could be a real issue rather than a corrupted 
> PDF.
> Here is the stack trace:
> {color:#172b4d}Exception in thread "main" 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@2aa27288{color}
> {color:#172b4d} at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312){color}
> {color:#172b4d} at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color}
> {color:#172b4d} at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204){color}
> {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:525){color}
> {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:495){color}
> {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:594){color}
> {color:#172b4d} at 
> org.apache.tika.example.SimpleTextExtractor.main(SimpleTextExtractor.java:32){color}
> {color:#172b4d}Caused by: java.lang.ClassCastException: class 
> org.apache.pdfbox.cos.COSArray cannot be cast to class 
> org.apache.pdfbox.cos.COSDictionary (org.apache.pdfbox.cos.COSArray and 
> org.apache.pdfbox.cos.COSDictionary are in unnamed module of loader 
> 'app'){color}
> {color:#172b4d} at 
> org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:53){color}
> {color:#172b4d} at 
> org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:30){color}
> {color:#172b4d} at 
> org.apache.pdfbox.pdmodel.common.PDNameTreeNode.getNames(PDNameTreeNode.java:272){color}
> {color:#172b4d} at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:856){color}
> {color:#172b4d} at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:871){color}
> {color:#172b4d} at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractEmbeddedDocuments(AbstractPDF2XHTML.java:375){color}
> {color:#172b4d} at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:998){color}
> {color:#172b4d} at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:253){color}
> {color:#172b4d} at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107){color}
> {color:#172b4d} at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219){color}
> {color:#172b4d} at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color}
> {color:#172b4d} ... 6 more{color}
>  
> {color:#172b4d}And here is the file that causes the issue:{color}
> [^112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to