> Then you should just use a memory-mapped file.
Unless I'm misunderstanding their original message, I believe they are using
a memory-mapped file. I'm not sure if other suggestions helped address the
issue, but my understanding was that they were somehow triggering reads against
the who
I believe the concern is that reading a record batch from a
RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the
OS before any data is accessed (and regardless of whether or not that data
is ever accessed).
I'm pretty sure the `RecordBatchStreamReader` uses
`MemoryMappedFile
On Sun, 26 Jan 2025 10:48:48 -0800
Sharvil Nanavati wrote:
> In a different context, fetching batches one-by-one would be a good way to
> control when the disk read takes place.
>
> In my context, I'm looking for a way to construct a Table without
> performing the bulk of the IO operations until
> Sharvil wants random access to only a few RecordBatches via Table
methods, but I don't think that's possible with the Arrow library
The idea (and I believe things worked this way at one point) was that you
could memory map a file, read in a bunch of record batches (even an entire
table if you wa
Thanks for the discussion, folks. I think the keg takeaway for me is that
my access pattern / use case isn't directly supported by Arrow today, but
there's no technical reason it can't be.
Would there be any opposition to me expanding the API surface to support a
zero-data-read-by-default implemen
> ...and that function triggers the MADV_WILLNEED
The code you linked specifies a memory region and the proceeding `nbytes`:
```
RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
{{memory_map_->data() + position, static_cast(nbytes)}}));
return memory_map_->Slice(position, nbytes)
I see, I was incorrectly conflating the pointer math and when a page fault is
actually generated. Thanks for clarifying!
Without knowing Sharvil's actual interactions with the Table, I'm still not
convinced a table method wouldn't trigger the scan anyways, but I suppose
that's more of a pessimi
> That's exactly what I'm looking for and assumed was the default behavior in
>an mmap world. I was surprised to find it wasn't, and the assumption of dense
>sequential access is baked in.
Just to clarify, Weston pointed out where my understanding (as far as mmap and
page faults) was incorrect.