Hi James,
according to the error message the stream is not properly terminated by
the token 'endstream'. While it might be a broken PDF it could also be a
valid one but the sequential parser you are using might be processing
junk data within the PDF.
I would recommend to use the non-sequential working parser with
PDDocument.loadNonSeq() instead of PDDocument.load(). Since you are
using PDFMergerUtility and this currently does not provide an option to
choose the other parser you could create an own class by copying
PDFMergerUtility and replacing the relevant calls (parameter scratchFile
may be set to null or an memory or file instance of RandomAccess).
You could file a JIRA feature request of adding such an option to
PDFMergerUtility - preferably with a patch :-)
If the error still exists than the PDF is broken and cannot be read by
PDFBox (some more healing mechanisms might be added to version 2.0).
Best,
Timo
Am 05.03.2014 12:11, schrieb James Carter (JIRA):
[
https://issues.apache.org/jira/browse/PDFBOX-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920743#comment-13920743
]
James Carter commented on PDFBOX-1920:
--------------------------------------
Hi Timo, I'm the developer of the 3rd party solution that is using PDFBox in
this case. If I understand the thread correctly, 3rd party PDF applications
are creating invalid PDFs that PDFBox attempts to 'repair'. I've tried
increasing the pushBackSize property, but encountering a different exception
during the merge (I've included the code excerpt + exception below). Is this
something that PDFBox could handle/repair, or do we need to handle this
elsewhere? (E.g validate the PDFs users are uploading and tell them if it's
invalid)
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "999000");
PDFMergerUtility mergePdf = new PDFMergerUtility();
FileOutputStream fos = new FileOutputStream("test.pdf");
mergePdf.addSource("docs/01. Heads of Terms (Signed).pdf");
mergePdf.setDestinationStream(fos);
mergePdf.mergeDocuments();
Exception in thread "main" java.io.IOException: expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@45cb0cdc
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
at
org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:196)
at com.acme.MergePDF.runSmartService(MergePDF.java:52)
at com.acme.MergePDF.main(MergePDF.java:68)
Buffer Error when trying to run node
------------------------------------
Key: PDFBOX-1920
URL: https://issues.apache.org/jira/browse/PDFBOX-1920
Project: PDFBox
Issue Type: Bug
Components: Utilities
Reporter: Chris Hewkin
Assignee: Timo Boehme
Attachments: Application.zip
Description: Trying to merge PDF using the latest Merge PDF Node but getting
the following error
There is a problem with task “Merge PDF” in the process “Create Application
Pack”
Problem: An error occurred in executing an Activity Class.
Details: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back
628696 bytes in order to reparse stream. Try increasing push back buffer using
system property org.apache.pdfbox.baseParser.pushBackSize
Recommended Action: Examine the activity class to correct the error and then
resume.
Priority of this problem: High Priority
--
This message was sent by Atlassian JIRA
(v6.2#6252)
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boe...@ontochem.com
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________