Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
> That's exactly what I'm looking for and assumed was the default behavior in >an mmap world. I was surprised to find it wasn't, and the assumption of dense >sequential access is baked in. Just to clarify, Weston pointed out where my understanding (as far as mmap and page faults) was incorrect.

Re: Demand-loading Arrow files

2025-01-28 Thread Sharvil Nanavati
Thanks for the discussion, folks. I think the keg takeaway for me is that my access pattern / use case isn't directly supported by Arrow today, but there's no technical reason it can't be. Would there be any opposition to me expanding the API surface to support a zero-data-read-by-default implemen

Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
I see, I was incorrectly conflating the pointer math and when a page fault is actually generated. Thanks for clarifying! Without knowing Sharvil's actual interactions with the Table, I'm still not convinced a table method wouldn't trigger the scan anyways, but I suppose that's more of a pessimi

Re: Demand-loading Arrow files

2025-01-28 Thread Weston Pace
> Sharvil wants random access to only a few RecordBatches via Table methods, but I don't think that's possible with the Arrow library The idea (and I believe things worked this way at one point) was that you could memory map a file, read in a bunch of record batches (even an entire table if you wa

Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
> ...and that function triggers the MADV_WILLNEED The code you linked specifies a memory region and the proceeding `nbytes`: ``` RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(       {{memory_map_->data() + position, static_cast(nbytes)}}));   return memory_map_->Slice(position, nbytes)

Re: Demand-loading Arrow files

2025-01-28 Thread Weston Pace
I believe the concern is that reading a record batch from a RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the OS before any data is accessed (and regardless of whether or not that data is ever accessed). I'm pretty sure the `RecordBatchStreamReader` uses `MemoryMappedFile

Re: Demand-loading Arrow files

2025-01-28 Thread Aldrin
> Then you should just use a memory-mapped file. Unless I'm misunderstanding their original message, I believe they are using a memory-mapped file. I'm not sure if other suggestions helped address the issue, but my understanding was that they were somehow triggering reads against the who

Re: Demand-loading Arrow files

2025-01-28 Thread Antoine Pitrou
On Sun, 26 Jan 2025 10:48:48 -0800 Sharvil Nanavati wrote: > In a different context, fetching batches one-by-one would be a good way to > control when the disk read takes place. > > In my context, I'm looking for a way to construct a Table without > performing the bulk of the IO operations until

Re: Demand-loading Arrow files

2025-01-26 Thread Dewey Dunnington
I am wondering if the validation step (where Arrow checks for corrupted data) is causing the full disk IO here. I am almost positive it's turned on by default (to prevent a crash when consuming untrusted input), but I *think* there is an option to turn it off if you are processing trusted input. I

Re: Demand-loading Arrow files

2025-01-26 Thread Sharvil Nanavati
In a different context, fetching batches one-by-one would be a good way to control when the disk read takes place. In my context, I'm looking for a way to construct a Table without performing the bulk of the IO operations until the memory is accessed. I need random access to the table and my acces

Re: Demand-loading Arrow files

2025-01-22 Thread Felipe Oliveira Carvalho
I don't have very specific advice, but mmap() and programmer control don't come together. The point of mmap is deferring all the logic to the OS and trusting that it knows better. If you're calling read_all(), it will do what the name says: read all the batches. Have you tried looping and getting