Looks like https://issues.apache.org/jira/browse/PDFBOX-5479
Am 13.12.23 um 14:50 schrieb Tilman Hausherr:
On 13.12.2023 11:23, Brangs, Erik wrote:
Hi,
we ran into problems when doing text extraction from the PDF
athttps://d-nb.info/1312454512/34 . We were using PDFBox 3.0.0 to extract the
text and the text extraction used up multiple GB of memory. The problem can be
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room
for improvement in text extraction in PDFBox for this case or is this just a
badly generated PDF?
Yeah it's a weird PDF: they have different font objects that point to
the same font file (See FontFile2). So the font is opened each time and
all tables are read amd stored. And since 3.0 we read much more tables
than in 2.0.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org