Am 08.02.23 um 15:59 schrieb Matthias Pigulla:
Dear PDFBox users,

I am using PDFBox to place overlays on lots of different input files. In 
general, this works very well and reliably – thank to everyone who worked for 
that!

However, there is one class of particularly awful input files, like the one at 
https://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf.

That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of 
cells.
Are you adding that overlay to every single page of that pdf? Whta is the purpose of that overlay? Maybe a rubber stamp is a better approach?


When I try to put a single-page PDF as overlay on it with PDFBox 2.0.27, I have 
to start the JVM with e. g. 8GB of heap memory, and it maxes out a CPU core on 
my machine for about 6 minutes. The maximum resident set size as reported by 
`time` is in the range of 2.4 GB. The result file is about four times the size 
of the input file.

With a snapshot build of 3.0.0, the max RSS seems not to go above 1 GB, but 
processing is not finished within 15 minutes (when I aborted). Regarding 3.0.0, 
I have seen the remarks at 
https://pdfbox.apache.org/3.0/migration.html#reduced-memory-usage, so I thought 
it might be worth a try. Probably the overlay will end up traversing all pages 
anyway, so that may not make a big difference.

If you are adding the overlay to all pages the parser more or less has to dig through the whole pdf.

My questions are:

- Is there anything I can do to make processing of such files faster or more 
efficient?
Maybe it is a better approach to use a rubberstamp instead of an overlay. With regard to 3.0.0 you might have a look at the kind of input source. Have a look at the different implementations of org.apache.pdfbox.io.RandomAccessRead. The migration guide might give you some additional hints about the usage of the input source

- What may be the reasons for the increase in output file size and can I do 
anything about it?
I guess your input files are using compressed object streams. 2.0.x doesn't support the creation of those streams so that those streams are decompressed when adding the overlay. 3.0.0 creates such compressed object streams by default, so that the result size should be similar to the input size

Andreas

Thanks!
-mp.




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to