Well that's my problem.. It works with PDFBox2 with reasonable sized files.
When it comes to the big ones it crashes.. So reading the migration guide
for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
create my own reader and stream cache. I see that I can provide my own
RandomAccessReader when I call Loader.loadPDF, but the loadPDF method that
takes a StreamCacheCreate function does not work as promised as the
StreamCacheCreateFunction is not passed from PDFParser to COSParser in the
PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
this is a bug?

On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

> On 31.01.2024 14:48, Lars Juel Jensen wrote:
> > This creates another problem for me. I am running PDFBox in a kubernetes
> > cluster on premises with limited resources. I can not setup persistent
> > volume claims nor ephemeral volumes, and I can not change how my pods are
> > started. I have limited resources and an emptyDir that is mounted on /tmp
> > where the temporary files go. The emptyDir is mapped to a portion of the
> > kubernetes node's memory, and this memory is shared with many other
> > services. All in all - I need to keep a very low memory and tempFile
> > footprint, hence the InputStream. Using RandomAccessReadBuffer with an
> > InputStream loads the entire PDF into memory, and I can encounter PDF
> > documents that can be over 1GB in size. So loading everything into memory
> > is not an option.
>
> You can try to create your own class extending RandomAccessRead.
>
> If your /tmp is mapped on main memory, then it doesn't make sense to use
> a temp file at all, you're just wasting time.
>
> Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
> scratch file) and had an even bigger footprint because it was also
> parsing the complete PDF. So if your project was working with PDFBox 2
> then it should work with PDFBox 3.
>
> Tilman
>
>
>
> >
> > On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <thaush...@t-online.de>
> > wrote:
> >
> >> On 31.01.2024 09:50, Lars Juel Jensen wrote:
> >>> In PDFBox2 I could do:
> >>>
> >>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
> >>>
> >>> But there is no equivalent to this in PDFBox3. How do I read a PDF from
> >> an
> >>> inputstream?
> >>>
> >> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
> >> IOUtils.createTempFileOnlyStreamCache());|
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to