I don't have very specific advice, but mmap() and programmer control don't
come together. The point of mmap is deferring all the logic to the OS and
trusting that it knows better.

If you're calling read_all(), it will do what the name says: read all the
batches. Have you tried looping and getting batches one by one as you
process them?

--
Felipe


On Tue, Jan 21, 2025 at 1:45 PM Sharvil Nanavati <shar...@lmnt.com> wrote:

> I'm loading a large number of large Arrow IPC streams/files from disk with
> mmap. I'd like to demand-load the contents instead of prefetching them – or
> at least have better control over disk IO.
>
> Calling `read_all` on a stream triggers a complete read of the file
> (`MADV_WILLNEED` over the entire byte range of the file) whereas `read_all`
> on a file seems to trigger a complete read through page faults. I'm not
> fully confident in the latter behavior.
>
> Is there a way I can disable prefetching in the stream case or configure
> Arrow to demand-load Tables? I'd like to get a reference to a Table without
> triggering disk reads except for the schema + magic bytes + metadata.
>
> -s
> *Builder @ LMNT*
> Web <https://www.lmnt.com> | LinkedIn
> <https://www.linkedin.com/in/sharvil-nanavati/>
>
>

Reply via email to