Hi, we ran into problems when doing text extraction from the PDF at https://d-nb.info/1312454512/34 . We were using PDFBox 3.0.0 to extract the text and the text extraction used up multiple GB of memory. The problem can be reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room for improvement in text extraction in PDFBox for this case or is this just a badly generated PDF?
-- Erik Brangs Deutsche Nationalbibliothek Informationstechnik Adickesallee 1 60322 Frankfurt am Main Telefon: +49 69 1525-1792 Telefax: +49 69 1525-1799 mailto:e.bra...@dnb.de https://www.dnb.de --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org