[ 
https://issues.apache.org/jira/browse/PDFBOX-6182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068096#comment-18068096
 ] 

Michael Klink commented on PDFBOX-6182:
---------------------------------------

{quote}In this case, the bad stream should be ignored instead of causing the 
whole page parse to fail.{quote}

While ignoring such errors may be good for digging some data out of broken 
PDFs, it lessens the fidelity of the reported data. Thus, IMO such errors 
should not be ignored in general.

To serve both purposes, one may want to add a {{PDFStreamEngine}} property to 
control this behavior. It might be a simple {{boolean}} or something as complex 
as a callback with the exception as argument.

> Ignore DataFormatException in malformed page content streams in 2.x, 
> consistent with newer PDFBox handling
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6182
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6182
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.36
>            Reporter: Constantine Dokolas
>            Priority: Major
>         Attachments: P117-redacted-Adobe.pdf
>
>
> The attached redacted sample PDF reproduces a PDFBox 2.x failure while 
> parsing page content streams. Processing page 1 aborts with 
> {{{}java.util.zip.DataFormatException: invalid stored block lengths{}}}. This 
> is the same basic zlib symptom seen in PDFBOX-3526, and it is closely related 
> in spirit to the more lenient malformed-stream handling that was added later 
> and is present in the codebase around the PDFBOX-5675/3.0.3 timeframe.
> In this case, the bad stream should be ignored instead of causing the whole 
> page parse to fail. Newer PDFBox code already follows that direction: PDPage 
> skips malformed content substreams, and 
> {{NonSeekableRandomAccessReadInputStream}} explicitly notes that if some data 
> could be read, an exception should not be thrown just because the stream ends 
> with a {{{}DataFormatException{}}}. I am requesting the same behavior in 2.x 
> for this malformed content-stream case.
> A minimal reproduction is:
> {code:java}
> import java.io.File;
> import org.apache.pdfbox.contentstream.PDFStreamEngine;
> import org.apache.pdfbox.pdmodel.PDDocument;
> public class Repro
> {
>     private static class Engine extends PDFStreamEngine
>     {
>     }
>     public static void main(String[] args) throws Exception
>     {
>         try (PDDocument doc = PDDocument.load(new 
> File("P117-redacted-Adobe.pdf")))
>         {
>             PDFStreamEngine engine = new Engine();
>             engine.processPage(doc.getPage(0));
>         }
>     }
> }
> {code}
> The expected result is that PDFBox keeps any already-decoded bytes, skips the 
> malformed content substream, and continues best-effort page processing 
> instead of failing the whole page with {{{}DataFormatException{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to