[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428538#comment-16428538 ]
Tilman Hausherr edited comment on PDFBOX-4182 at 10/24/24 2:06 AM: ------------------------------------------------------------------- There was a recent SO issue about this problem and how to work around it: [https://stackoverflow.com/questions/48643074/how-to-make-streamed-pdf-merging-without-memory-consumption] Closing the files earlier can't be done because in some rare cases, resources are not properly cloned so these are still used for the destination. Sadly I can't find the issue... I think this was related to the structure tree. Opening the files later wouldn't have any effect considering that we can't close ealier. An alternative would be to sambox [https://github.com/torakiki/sambox] , this is a PDFBox clone specialized in split and merge. was (Author: tilman): There was a recent SO issue about this Problem and how to work around it: [https://stackoverflow.com/questions/48643074/how-to-make-streamed-pdf-merging-without-memory-consumption] Closing the files earlier can't be done because in some rare cases, resources are not properly cloned so these are still used for the destination. Sadly I can't find the issue... I think this was related to the structure tree. Opening the files later wouldn't have any effect considering that we can't close ealier. An alternative would be to sambox [https://github.com/torakiki/sambox] , this is a PDFBox clone specialized in split and merge. > Improve memory usage of PDFMergerUtility > ---------------------------------------- > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.9 > Reporter: Pas Filip > Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org