Re: [jira] [Commented] (PDFBOX-1920) Buffer Error when trying to run node

Timo Boehme Wed, 05 Mar 2014 04:45:12 -0800

Hi James,

according to the error message the stream is not properly terminated bythe token 'endstream'. While it might be a broken PDF it could also be avalid one but the sequential parser you are using might be processingjunk data within the PDF.

I would recommend to use the non-sequential working parser withPDDocument.loadNonSeq() instead of PDDocument.load(). Since you areusing PDFMergerUtility and this currently does not provide an option tochoose the other parser you could create an own class by copyingPDFMergerUtility and replacing the relevant calls (parameter scratchFilemay be set to null or an memory or file instance of RandomAccess).You could file a JIRA feature request of adding such an option toPDFMergerUtility - preferably with a patch :-)

If the error still exists than the PDF is broken and cannot be read byPDFBox (some more healing mechanisms might be added to version 2.0).



Best,
Timo


Am 05.03.2014 12:11, schrieb James Carter (JIRA):


     [ 
https://issues.apache.org/jira/browse/PDFBOX-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920743#comment-13920743
 ]

James Carter commented on PDFBOX-1920:
--------------------------------------

Hi Timo, I'm the developer of the 3rd party solution that is using PDFBox in 
this case.  If I understand the thread correctly, 3rd party PDF applications 
are creating invalid PDFs that PDFBox attempts to 'repair'. I've tried 
increasing the pushBackSize property, but encountering a different exception 
during the merge (I've included the code excerpt + exception below). Is this 
something that PDFBox could handle/repair, or do we need to handle this 
elsewhere? (E.g validate the PDFs users are uploading and tell them if it's 
invalid)

System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "999000");
PDFMergerUtility mergePdf = new PDFMergerUtility();
FileOutputStream fos = new FileOutputStream("test.pdf");

mergePdf.addSource("docs/01. Heads of Terms (Signed).pdf");
mergePdf.setDestinationStream(fos);
mergePdf.mergeDocuments();


Exception in thread "main" java.io.IOException: expected='endstream' actual='' 
org.apache.pdfbox.io.PushBackInputStream@45cb0cdc
     at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
     at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
     at 
org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:196)
     at com.acme.MergePDF.runSmartService(MergePDF.java:52)
     at com.acme.MergePDF.main(MergePDF.java:68)

Buffer Error when trying to run node
------------------------------------

                 Key: PDFBOX-1920
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1920
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
            Reporter: Chris Hewkin
            Assignee: Timo Boehme
         Attachments: Application.zip


Description: Trying to merge PDF using the latest Merge PDF Node but getting 
the following error
There is a problem with task “Merge PDF” in the process “Create Application 
Pack”
Problem: An error occurred in executing an Activity Class.
Details: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 
628696 bytes in order to reparse stream. Try increasing push back buffer using 
system property org.apache.pdfbox.baseParser.pushBackSize
Recommended Action: Examine the activity class to correct the error and then 
resume.
Priority of this problem: High Priority




--
This message was sent by Atlassian JIRA
(v6.2#6252)



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Re: [jira] [Commented] (PDFBOX-1920) Buffer Error when trying to run node

Reply via email to