Re: Text extraction from a certain PDF uses up multiple GB of memory

Andreas Lehmkühler Thu, 14 Dec 2023 23:07:17 -0800

Looks like https://issues.apache.org/jira/browse/PDFBOX-5479


Am 13.12.23 um 14:50 schrieb Tilman Hausherr:

On 13.12.2023 11:23, Brangs, Erik wrote:
Hi,

we ran into problems when doing text extraction from the PDF 
athttps://d-nb.info/1312454512/34  . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?
Yeah it's a weird PDF: they have different font objects that point tothe same font file (See FontFile2). So the font is opened each time andall tables are read amd stored. And since 3.0 we read much more tablesthan in 2.0.
Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Text extraction from a certain PDF uses up multiple GB of memory

Reply via email to