[ https://issues.apache.org/jira/browse/PDFBOX-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723244#comment-17723244 ]
Zbigniew Minciel commented on PDFBOX-5602: ------------------------------------------ That explains the issue. I am not familiar with PDF specifications. I was hoping that PDF supports some sort of "include" mode with each chapter's meta data independent. The test I was doing is extreme just to learn a bit about the pdfbox and PDF spec and maybe contribute something in the process. In reality I don't expect users of MBox Viewer to merge thousands of mails. Viewing, searching of huge PDF document is not the best experience anyway. Pdfbox is doing good job merging say 100 mails which is on the high end anyway. Thanks for your time. Sounds if I would like to have pdfbox version without the structure tree, I would need to fork the pdfbox and build myself the custom version -:((((. It looks like there is no licensing issues of doing that. Not sure I would like to do that and keep up to date with the official pdfbox. Respectfully, Zbigniew > Consider adding support for PDF files Concatenation in addition to the full > Merge > ---------------------------------------------------------------------------------- > > Key: PDFBOX-5602 > URL: https://issues.apache.org/jira/browse/PDFBOX-5602 > Project: PDFBox > Issue Type: New Feature > Components: Utilities > Affects Versions: 3.0.0 PDFBox > Reporter: Zbigniew Minciel > Priority: Major > Attachments: CapturePdfDebugger.PNG, Large527MbytesPDF.PNG, > cpu-20-middle.PNG, cpu-hot-spots-3.0.0-SNAPSHOT-1.PNG, > cpu-hot-spots-3.0.0-SNAPSHOT.PNG, cpu-hot-spots-3.0.0-alpha3-1.PNG, > cpu-hot-spots-3.0.0-alpha3.PNG > > > I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of > PDF files. > I attempted to merge 7500 mails in separate PDF files on Windows. Given the > limitation on the max size of the command line arguments, I was merging > subsets of files. I ended up with 5 large PDF files, each around > 500-600MBytes. I tried to merge these 5 files but eventually merge failed > after running more than 6 hours. See error log at the bottom. I have large > RAM 48GBytes. PDFBox was using up 13GB of memory max. Usage was changing > between 600MB and 13Gb. > I am wondering whether PDFBox could support Concatenation mode in addition to > the full Merge mode. No need to create index table, etc. It could work as > follow I suppose given my total lack of understanding how PDF works: > # Read first file, process and append to the target PDF file. Delete PDF > data and related meta data for this file except perhaps the last page number. > # Read the second file and process in similar fashion as in the step 1 > # etc > If Concatenation is possible, it would greatly reduce the cpu and memory > overhead and reduce processing time. > I admit merging of such large number of PDF files is not typical but the > issue is valid. > ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.base/java.util.Hashtable.rehash(Hashtable.java:419) > at java.base/java.util.Hashtable.addEntry(Hashtable.java:441) > at java.base/java.util.Hashtable.put(Hashtable.java:493) > at > org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481) > at > org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260) > at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402) > at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542) > at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418) > at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018) > at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963) > at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982) > at > org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476) > at > org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355) > at > org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339) > at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76) > at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37) > at picocli.CommandLine.executeUserObject(CommandLine.java:1953) > at picocli.CommandLine.access$1300(CommandLine.java:145) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2352) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2314) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2316) > at picocli.CommandLine.execute(CommandLine.java:2078) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) > Respectfully, > Zbigniew > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org