Yurii created TIKA-3847:
---------------------------

             Summary: NullPointerException when processing pdf document(Allow 
proceed on RuntimeException)
                 Key: TIKA-3847
                 URL: https://issues.apache.org/jira/browse/TIKA-3847
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.4.1
            Reporter: Yurii


I have a pdf document with some corrupted pages(throws error even in PDF 
readers like adobe acrobat). However there are only few pages of 370 are 
failing.

The issue is that first corrupted page is 15th and whole document processing 
failed after that.

There is a nullPointer exception thrown on getting fonts of the corrupted page.
What I propose is to allow(by config, default false) handle any runtime 
exception like we do for IntermediaryIOExceptions.

Unfortunately due to NDA I can't share a document for debug purposes but here 
is a stacktrace below:
{code:java}
java.lang.NullPointerException
    at org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
    at 
org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
    at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
    at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
    at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
    at 
org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
    at 
org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) {code}
I will attach proposed patch soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to