In a different context, fetching batches one-by-one would be a good way to control when the disk read takes place.
In my context, I'm looking for a way to construct a Table without performing the bulk of the IO operations until the memory is accessed. I need random access to the table and my accesses are often sparse relative to the size of the entire table. Obviously there has to be *some* IO to read the schema and offsets, but that's tiny relative to the data itself. Is there any way to get a Table instance without triggering large data reads of the Arrow file? -s *Builder @ LMNT* Web <https://www.lmnt.com> | LinkedIn <https://www.linkedin.com/in/sharvil-nanavati/> On Wed, Jan 22, 2025 at 5:56 AM Felipe Oliveira Carvalho < felipe...@gmail.com> wrote: > I don't have very specific advice, but mmap() and programmer control don't > come together. The point of mmap is deferring all the logic to the OS and > trusting that it knows better. > > If you're calling read_all(), it will do what the name says: read all the > batches. Have you tried looping and getting batches one by one as you > process them? > > -- > Felipe > > > On Tue, Jan 21, 2025 at 1:45 PM Sharvil Nanavati <shar...@lmnt.com> wrote: > >> I'm loading a large number of large Arrow IPC streams/files from disk >> with mmap. I'd like to demand-load the contents instead of prefetching them >> – or at least have better control over disk IO. >> >> Calling `read_all` on a stream triggers a complete read of the file >> (`MADV_WILLNEED` over the entire byte range of the file) whereas `read_all` >> on a file seems to trigger a complete read through page faults. I'm not >> fully confident in the latter behavior. >> >> Is there a way I can disable prefetching in the stream case or configure >> Arrow to demand-load Tables? I'd like to get a reference to a Table without >> triggering disk reads except for the schema + magic bytes + metadata. >> >> -s >> *Builder @ LMNT* >> Web <https://www.lmnt.com> | LinkedIn >> <https://www.linkedin.com/in/sharvil-nanavati/> >> >>