Constantine Dokolas created PDFBOX-6182:
-------------------------------------------
Summary: Ignore DataFormatException in malformed page content
streams in 2.x, consistent with newer PDFBox handling
Key: PDFBOX-6182
URL: https://issues.apache.org/jira/browse/PDFBOX-6182
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.36
Reporter: Constantine Dokolas
Attachments: P117-redacted-Adobe.pdf
The attached redacted sample PDF reproduces a PDFBox 2.x failure while parsing
page content streams. Processing page 1 aborts with
{{{}java.util.zip.DataFormatException: invalid stored block lengths{}}}. This
is the same basic zlib symptom seen in PDFBOX-3526, and it is closely related
in spirit to the more lenient malformed-stream handling that was added later
and is present in the codebase around the PDFBOX-5675/3.0.3 timeframe.
In this case, the bad stream should be ignored instead of causing the whole
page parse to fail. Newer PDFBox code already follows that direction: PDPage
skips malformed content substreams, and
{{NonSeekableRandomAccessReadInputStream}} explicitly notes that if some data
could be read, an exception should not be thrown just because the stream ends
with a {{{}DataFormatException{}}}. I am requesting the same behavior in 2.x
for this malformed content-stream case.
A minimal reproduction is:
{code:java}
import java.io.File;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.pdmodel.PDDocument;
public class Repro
{
private static class Engine extends PDFStreamEngine
{
}
public static void main(String[] args) throws Exception
{
try (PDDocument doc = PDDocument.load(new
File("P117-redacted-Adobe.pdf")))
{
PDFStreamEngine engine = new Engine();
engine.processPage(doc.getPage(0));
}
}
}
{code}
The expected result is that PDFBox keeps any already-decoded bytes, skips the
malformed content substream, and continues best-effort page processing instead
of failing the whole page with {{{}DataFormatException{}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]