In a different context, fetching batches one-by-one would be a good way to
control when the disk read takes place.

In my context, I'm looking for a way to construct a Table without
performing the bulk of the IO operations until the memory is accessed. I
need random access to the table and my accesses are often sparse relative
to the size of the entire table. Obviously there has to be *some* IO to
read the schema and offsets, but that's tiny relative to the data itself.

Is there any way to get a Table instance without triggering large data
reads of the Arrow file?

-s
*Builder @ LMNT*
Web <https://www.lmnt.com> | LinkedIn
<https://www.linkedin.com/in/sharvil-nanavati/>



On Wed, Jan 22, 2025 at 5:56 AM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> I don't have very specific advice, but mmap() and programmer control don't
> come together. The point of mmap is deferring all the logic to the OS and
> trusting that it knows better.
>
> If you're calling read_all(), it will do what the name says: read all the
> batches. Have you tried looping and getting batches one by one as you
> process them?
>
> --
> Felipe
>
>
> On Tue, Jan 21, 2025 at 1:45 PM Sharvil Nanavati <shar...@lmnt.com> wrote:
>
>> I'm loading a large number of large Arrow IPC streams/files from disk
>> with mmap. I'd like to demand-load the contents instead of prefetching them
>> – or at least have better control over disk IO.
>>
>> Calling `read_all` on a stream triggers a complete read of the file
>> (`MADV_WILLNEED` over the entire byte range of the file) whereas `read_all`
>> on a file seems to trigger a complete read through page faults. I'm not
>> fully confident in the latter behavior.
>>
>> Is there a way I can disable prefetching in the stream case or configure
>> Arrow to demand-load Tables? I'd like to get a reference to a Table without
>> triggering disk reads except for the schema + magic bytes + metadata.
>>
>> -s
>> *Builder @ LMNT*
>> Web <https://www.lmnt.com> | LinkedIn
>> <https://www.linkedin.com/in/sharvil-nanavati/>
>>
>>

Reply via email to