Dear PDFBox users,

I am using PDFBox to place overlays on lots of different input files. In 
general, this works very well and reliably – thank to everyone who worked for 
that!

However, there is one class of particularly awful input files, like the one at 
https://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf.

That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of 
cells.

When I try to put a single-page PDF as overlay on it with PDFBox 2.0.27, I have 
to start the JVM with e. g. 8GB of heap memory, and it maxes out a CPU core on 
my machine for about 6 minutes. The maximum resident set size as reported by 
`time` is in the range of 2.4 GB. The result file is about four times the size 
of the input file.

With a snapshot build of 3.0.0, the max RSS seems not to go above 1 GB, but 
processing is not finished within 15 minutes (when I aborted). Regarding 3.0.0, 
I have seen the remarks at 
https://pdfbox.apache.org/3.0/migration.html#reduced-memory-usage, so I thought 
it might be worth a try. Probably the overlay will end up traversing all pages 
anyway, so that may not make a big difference.

My questions are:

- Is there anything I can do to make processing of such files faster or more 
efficient?
- What may be the reasons for the increase in output file size and can I do 
anything about it?

Thanks!
-mp.

Reply via email to