Yurii created TIKA-3847:
---------------------------
Summary: NullPointerException when processing pdf document(Allow
proceed on RuntimeException)
Key: TIKA-3847
URL: https://issues.apache.org/jira/browse/TIKA-3847
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.4.1
Reporter: Yurii
I have a pdf document with some corrupted pages(throws error even in PDF
readers like adobe acrobat). However there are only few pages of 370 are
failing.
The issue is that first corrupted page is 15th and whole document processing
failed after that.
There is a nullPointer exception thrown on getting fonts of the corrupted page.
What I propose is to allow(by config, default false) handle any runtime
exception like we do for IntermediaryIOExceptions.
Unfortunately due to NDA I can't share a document for debug purposes but here
is a stacktrace below:
{code:java}
java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
at
org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
at
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
at
org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
at
org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) {code}
I will attach proposed patch soon.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)