Am 08.02.23 um 15:59 schrieb Matthias Pigulla:
Dear PDFBox users,
I am using PDFBox to place overlays on lots of different input files. In
general, this works very well and reliably – thank to everyone who worked for
that!
However, there is one class of particularly awful input files, like the one at
https://www.g-ba.de/downloads/40-268-7473/2021-03-18_ASV-RL_Anpassung-Appendizes-an-EBM_TrG.pdf.
That’s a more than 50 MB, 2000+ pages beast full of complex tables with lots of
cells.
Are you adding that overlay to every single page of that pdf? Whta is the
purpose of that overlay? Maybe a rubber stamp is a better approach?
When I try to put a single-page PDF as overlay on it with PDFBox 2.0.27, I have
to start the JVM with e. g. 8GB of heap memory, and it maxes out a CPU core on
my machine for about 6 minutes. The maximum resident set size as reported by
`time` is in the range of 2.4 GB. The result file is about four times the size
of the input file.
With a snapshot build of 3.0.0, the max RSS seems not to go above 1 GB, but
processing is not finished within 15 minutes (when I aborted). Regarding 3.0.0,
I have seen the remarks at
https://pdfbox.apache.org/3.0/migration.html#reduced-memory-usage, so I thought
it might be worth a try. Probably the overlay will end up traversing all pages
anyway, so that may not make a big difference.
If you are adding the overlay to all pages the parser more or less has to dig
through the whole pdf.
My questions are:
- Is there anything I can do to make processing of such files faster or more
efficient?
Maybe it is a better approach to use a rubberstamp instead of an overlay. With
regard to 3.0.0 you might have a look at the kind of input source. Have a look
at the different implementations of org.apache.pdfbox.io.RandomAccessRead. The
migration guide might give you some additional hints about the usage of the
input source
- What may be the reasons for the increase in output file size and can I do
anything about it?
I guess your input files are using compressed object streams. 2.0.x doesn't
support the creation of those streams so that those streams are decompressed
when adding the overlay. 3.0.0 creates such compressed object streams by
default, so that the result size should be similar to the input size
Andreas
Thanks!
-mp.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org