[jira] [Comment Edited] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

Tilman Hausherr (Jira) Wed, 23 Oct 2024 19:07:04 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428538#comment-16428538
 ]


Tilman Hausherr edited comment on PDFBOX-4182 at 10/24/24 2:06 AM:
-------------------------------------------------------------------

There was a recent SO issue about this problem and how to work around it:

[https://stackoverflow.com/questions/48643074/how-to-make-streamed-pdf-merging-without-memory-consumption]
  
 Closing the files earlier can't be done because in some rare cases, resources 
are not properly cloned so these are still used for the destination. Sadly I 
can't find the issue... I think this was related to the structure tree.
  
 Opening the files later wouldn't have any effect considering that we can't 
close ealier.
 An alternative would be to sambox [https://github.com/torakiki/sambox] , this 
is a PDFBox clone specialized in split and merge.


was (Author: tilman):
There was a recent SO issue about this Problem and how to work around it:

[https://stackoverflow.com/questions/48643074/how-to-make-streamed-pdf-merging-without-memory-consumption]
  
 Closing the files earlier can't be done because in some rare cases, resources 
are not properly cloned so these are still used for the destination. Sadly I 
can't find the issue... I think this was related to the structure tree.
  
 Opening the files later wouldn't have any effect considering that we can't 
close ealier.
 An alternative would be to sambox [https://github.com/torakiki/sambox] , this 
is a PDFBox clone specialized in split and merge.

> Improve memory usage of PDFMergerUtility
> ----------------------------------------
>
>                 Key: PDFBOX-4182
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4182
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9
>            Reporter: Pas Filip
>            Priority: Major
>         Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, merge-utility.patch, 
> oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
>
>
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4182) Improve memory usage of PDFMergerUtility

Reply via email to