Hi James,

according to the error message the stream is not properly terminated by the token 'endstream'. While it might be a broken PDF it could also be a valid one but the sequential parser you are using might be processing junk data within the PDF.

I would recommend to use the non-sequential working parser with PDDocument.loadNonSeq() instead of PDDocument.load(). Since you are using PDFMergerUtility and this currently does not provide an option to choose the other parser you could create an own class by copying PDFMergerUtility and replacing the relevant calls (parameter scratchFile may be set to null or an memory or file instance of RandomAccess). You could file a JIRA feature request of adding such an option to PDFMergerUtility - preferably with a patch :-)

If the error still exists than the PDF is broken and cannot be read by PDFBox (some more healing mechanisms might be added to version 2.0).


Best,
Timo


Am 05.03.2014 12:11, schrieb James Carter (JIRA):

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920743#comment-13920743
 ]

James Carter commented on PDFBOX-1920:
--------------------------------------

Hi Timo, I'm the developer of the 3rd party solution that is using PDFBox in 
this case.  If I understand the thread correctly, 3rd party PDF applications 
are creating invalid PDFs that PDFBox attempts to 'repair'. I've tried 
increasing the pushBackSize property, but encountering a different exception 
during the merge (I've included the code excerpt + exception below). Is this 
something that PDFBox could handle/repair, or do we need to handle this 
elsewhere? (E.g validate the PDFs users are uploading and tell them if it's 
invalid)

System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "999000");
PDFMergerUtility mergePdf = new PDFMergerUtility();
FileOutputStream fos = new FileOutputStream("test.pdf");

mergePdf.addSource("docs/01. Heads of Terms (Signed).pdf");
mergePdf.setDestinationStream(fos);
mergePdf.mergeDocuments();


Exception in thread "main" java.io.IOException: expected='endstream' actual='' 
org.apache.pdfbox.io.PushBackInputStream@45cb0cdc
     at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
     at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
     at 
org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:196)
     at com.acme.MergePDF.runSmartService(MergePDF.java:52)
     at com.acme.MergePDF.main(MergePDF.java:68)

Buffer Error when trying to run node
------------------------------------

                 Key: PDFBOX-1920
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1920
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
            Reporter: Chris Hewkin
            Assignee: Timo Boehme
         Attachments: Application.zip


Description: Trying to merge PDF using the latest Merge PDF Node but getting 
the following error
There is a problem with task “Merge PDF” in the process “Create Application 
Pack”
Problem: An error occurred in executing an Activity Class.
Details: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 
628696 bytes in order to reparse stream. Try increasing push back buffer using 
system property org.apache.pdfbox.baseParser.pushBackSize
Recommended Action: Examine the activity class to correct the error and then 
resume.
Priority of this problem: High Priority



--
This message was sent by Atlassian JIRA
(v6.2#6252)



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Reply via email to