Looks like https://issues.apache.org/jira/browse/PDFBOX-5479

Am 13.12.23 um 14:50 schrieb Tilman Hausherr:
On 13.12.2023 11:23, Brangs, Erik wrote:
Hi,

we ran into problems when doing text extraction from the PDF 
athttps://d-nb.info/1312454512/34  . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?

Yeah it's a weird PDF: they have different font objects that point to the same font file (See FontFile2). So the font is opened each time and all tables are read amd stored. And since 3.0 we read much more tables than in 2.0.
Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to