Constantine Dokolas created PDFBOX-6182:
-------------------------------------------

             Summary: Ignore DataFormatException in malformed page content 
streams in 2.x, consistent with newer PDFBox handling
                 Key: PDFBOX-6182
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6182
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.36
            Reporter: Constantine Dokolas
         Attachments: P117-redacted-Adobe.pdf

The attached redacted sample PDF reproduces a PDFBox 2.x failure while parsing 
page content streams. Processing page 1 aborts with 
{{{}java.util.zip.DataFormatException: invalid stored block lengths{}}}. This 
is the same basic zlib symptom seen in PDFBOX-3526, and it is closely related 
in spirit to the more lenient malformed-stream handling that was added later 
and is present in the codebase around the PDFBOX-5675/3.0.3 timeframe.

In this case, the bad stream should be ignored instead of causing the whole 
page parse to fail. Newer PDFBox code already follows that direction: PDPage 
skips malformed content substreams, and 
{{NonSeekableRandomAccessReadInputStream}} explicitly notes that if some data 
could be read, an exception should not be thrown just because the stream ends 
with a {{{}DataFormatException{}}}. I am requesting the same behavior in 2.x 
for this malformed content-stream case.

A minimal reproduction is:
{code:java}
import java.io.File;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.pdmodel.PDDocument;

public class Repro
{
    private static class Engine extends PDFStreamEngine
    {
    }

    public static void main(String[] args) throws Exception
    {
        try (PDDocument doc = PDDocument.load(new 
File("P117-redacted-Adobe.pdf")))
        {
            PDFStreamEngine engine = new Engine();
            engine.processPage(doc.getPage(0));
        }
    }
}
{code}
The expected result is that PDFBox keeps any already-decoded bytes, skips the 
malformed content substream, and continues best-effort page processing instead 
of failing the whole page with {{{}DataFormatException{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to