Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
> Then you should just use a memory-mapped file. Unless I'm misunderstanding their original message, I believe they are using a memory-mapped file. I'm not sure if other suggestions helped address the issue, but my understanding was that they were somehow triggering reads against the who

Re: Demand-loading Arrow files

2025-01-28 Thread Weston Pace
I believe the concern is that reading a record batch from a RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the OS before any data is accessed (and regardless of whether or not that data is ever accessed). I'm pretty sure the `RecordBatchStreamReader` uses `MemoryMappedFile

Re: Demand-loading Arrow files

2025-01-28 Thread Antoine Pitrou
On Sun, 26 Jan 2025 10:48:48 -0800 Sharvil Nanavati wrote: > In a different context, fetching batches one-by-one would be a good way to > control when the disk read takes place. > > In my context, I'm looking for a way to construct a Table without > performing the bulk of the IO operations until

Re: Demand-loading Arrow files

2025-01-28 Thread Weston Pace
> Sharvil wants random access to only a few RecordBatches via Table methods, but I don't think that's possible with the Arrow library The idea (and I believe things worked this way at one point) was that you could memory map a file, read in a bunch of record batches (even an entire table if you wa

Re: Demand-loading Arrow files

2025-01-28 Thread Sharvil Nanavati
Thanks for the discussion, folks. I think the keg takeaway for me is that my access pattern / use case isn't directly supported by Arrow today, but there's no technical reason it can't be. Would there be any opposition to me expanding the API surface to support a zero-data-read-by-default implemen

Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
> ...and that function triggers the MADV_WILLNEED The code you linked specifies a memory region and the proceeding `nbytes`: ``` RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(       {{memory_map_->data() + position, static_cast(nbytes)}}));   return memory_map_->Slice(position, nbytes)

Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
I see, I was incorrectly conflating the pointer math and when a page fault is actually generated. Thanks for clarifying! Without knowing Sharvil's actual interactions with the Table, I'm still not convinced a table method wouldn't trigger the scan anyways, but I suppose that's more of a pessimi

Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
> That's exactly what I'm looking for and assumed was the default behavior in >an mmap world. I was surprised to find it wasn't, and the assumption of dense >sequential access is baked in. Just to clarify, Weston pointed out where my understanding (as far as mmap and page faults) was incorrect.