Text extraction from a certain PDF uses up multiple GB of memory

Brangs, Erik Wed, 13 Dec 2023 02:26:12 -0800

Hi,

we ran into problems when doing text extraction from the PDF at 
https://d-nb.info/1312454512/34 . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?


-- 
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.bra...@dnb.de
https://www.dnb.de


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Text extraction from a certain PDF uses up multiple GB of memory

Reply via email to