Hi,

we ran into problems when doing text extraction from the PDF at 
https://d-nb.info/1312454512/34 . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?

-- 
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.bra...@dnb.de
https://www.dnb.de


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to