Well that's my problem.. It works with PDFBox2 with reasonable sized files. When it comes to the big ones it crashes.. So reading the migration guide for PDFBox3.0 I thought I saw some light in the tunnel as it says I can create my own reader and stream cache. I see that I can provide my own RandomAccessReader when I call Loader.loadPDF, but the loadPDF method that takes a StreamCacheCreate function does not work as promised as the StreamCacheCreateFunction is not passed from PDFParser to COSParser in the PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess this is a bug?
On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <thaush...@t-online.de> wrote: > On 31.01.2024 14:48, Lars Juel Jensen wrote: > > This creates another problem for me. I am running PDFBox in a kubernetes > > cluster on premises with limited resources. I can not setup persistent > > volume claims nor ephemeral volumes, and I can not change how my pods are > > started. I have limited resources and an emptyDir that is mounted on /tmp > > where the temporary files go. The emptyDir is mapped to a portion of the > > kubernetes node's memory, and this memory is shared with many other > > services. All in all - I need to keep a very low memory and tempFile > > footprint, hence the InputStream. Using RandomAccessReadBuffer with an > > InputStream loads the entire PDF into memory, and I can encounter PDF > > documents that can be over 1GB in size. So loading everything into memory > > is not an option. > > You can try to create your own class extending RandomAccessRead. > > If your /tmp is mapped on main memory, then it doesn't make sense to use > a temp file at all, you're just wasting time. > > Btw PDFBox 2 was also loading the whole PDF file into memory (or into a > scratch file) and had an even bigger footprint because it was also > parsing the complete PDF. So if your project was working with PDFBox 2 > then it should work with PDFBox 3. > > Tilman > > > > > > > On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <thaush...@t-online.de> > > wrote: > > > >> On 31.01.2024 09:50, Lars Juel Jensen wrote: > >>> In PDFBox2 I could do: > >>> > >>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly()) > >>> > >>> But there is no equivalent to this in PDFBox3. How do I read a PDF from > >> an > >>> inputstream? > >>> > >> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream), > >> IOUtils.createTempFileOnlyStreamCache());| > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >