[jira] [Commented] (PDFBOX-5602) Consider adding support for PDF files Concatenation in addition to the full Merge

Zbigniew Minciel (Jira) Sat, 13 May 2023 08:47:54 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722379#comment-17722379
 ]


Zbigniew Minciel commented on PDFBOX-5602:
------------------------------------------

You are right I don't have understanding of PDF. Index Table, wrong term, I was 
thinking about Table of Content generated based on structure tree ??

I am looking for something close to concatenation to reduce the processing 
overhead and required memory resources to help to complete successfully merging 
of large PDF files in the shortest time possible.

In my test, the total size of input PDFfiles was ~ 2.5Gb but Task manager was 
showing usage of 13GB at some point. It took 6 hours to declare failure.

I was wondering whether merge process can be optimized and *Lite Merge* option 
supported by the official PDFBox.

I appreciate if you can comment on the below:
 # Does the PDFBox load the content of all  PDF files and keeps in memory until 
it can generate target PDF?
 # PDFBox was using close to 13Gb of memory at some point,  is this expected?
 # Merging, if successful, it would take at least 6 hours to complete. Is this 
reasonable and should be expected?
 # You mention {{appendDocument()}} method. Does it mean that document is 
appended to the target PDF file or to the target PDF in memory?

I am using PDFBox merge and it works very reliably which I appreciate very 
much. It would be great if something like Lite Merge was supported by the 
official release. I suspect more users would be interested in such feature.

Respectfully,

Zbigniew

 

 

> Consider adding support for PDF files Concatenation in addition to the  full 
> Merge
> ----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5602
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5602
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Utilities
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Zbigniew Minciel
>            Priority: Major
>
> I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of 
> PDF files.
> I attempted to merge 7500 mails in separate PDF files on Windows. Given the 
> limitation on the max size of the command line arguments, I was merging 
> subsets of files. I ended up with 5 large PDF files, each around 
> 500-600MBytes. I tried to merge these 5 files but eventually merge failed 
> after running more than 6 hours.  See error log at the bottom. I have large 
> RAM 48GBytes.  PDFBox was using up 13GB of memory max. Usage was changing 
> between 600MB and 13Gb. 
> I am wondering whether PDFBox could support Concatenation mode in addition to 
> the full Merge mode.  No need to create index table, etc. It could work as 
> follow I suppose given my total lack of understanding how PDF works:
>  # Read first file, process and append to the target PDF file. Delete PDF 
> data and related meta data for this file except perhaps the last page number.
>  # Read the second file and process in similar fashion as in the step 1
>  # etc
> If Concatenation is possible, it would greatly reduce the cpu and memory 
> overhead and reduce processing time.
> I admit merging of such large number of PDF files is not typical but the 
> issue is valid.
> ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Hashtable.rehash(Hashtable.java:419)
>     at java.base/java.util.Hashtable.addEntry(Hashtable.java:441)
>     at java.base/java.util.Hashtable.put(Hashtable.java:493)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260)
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>     at picocli.CommandLine.access$1300(CommandLine.java:145)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
>     at picocli.CommandLine.execute(CommandLine.java:2078)
>     at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
> Respectfully,
> Zbigniew
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5602) Consider adding support for PDF files Concatenation in addition to the full Merge

Reply via email to