> That's exactly what I'm looking for and assumed was the default behavior in
>an mmap world. I was surprised to find it wasn't, and the assumption of dense
>sequential access is baked in.
Just to clarify, Weston pointed out where my understanding (as far as mmap and
page faults) was incorrect.
Thanks for the discussion, folks. I think the keg takeaway for me is that
my access pattern / use case isn't directly supported by Arrow today, but
there's no technical reason it can't be.
Would there be any opposition to me expanding the API surface to support a
zero-data-read-by-default implemen
I see, I was incorrectly conflating the pointer math and when a page fault is
actually generated. Thanks for clarifying!
Without knowing Sharvil's actual interactions with the Table, I'm still not
convinced a table method wouldn't trigger the scan anyways, but I suppose
that's more of a pessimi
> Sharvil wants random access to only a few RecordBatches via Table
methods, but I don't think that's possible with the Arrow library
The idea (and I believe things worked this way at one point) was that you
could memory map a file, read in a bunch of record batches (even an entire
table if you wa
> ...and that function triggers the MADV_WILLNEED
The code you linked specifies a memory region and the proceeding `nbytes`:
```
RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
{{memory_map_->data() + position, static_cast(nbytes)}}));
return memory_map_->Slice(position, nbytes)
I believe the concern is that reading a record batch from a
RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the
OS before any data is accessed (and regardless of whether or not that data
is ever accessed).
I'm pretty sure the `RecordBatchStreamReader` uses
`MemoryMappedFile
> Then you should just use a memory-mapped file.
Unless I'm misunderstanding their original message, I believe they are using
a memory-mapped file. I'm not sure if other suggestions helped address the
issue, but my understanding was that they were somehow triggering reads against
the who
On Sun, 26 Jan 2025 10:48:48 -0800
Sharvil Nanavati wrote:
> In a different context, fetching batches one-by-one would be a good way to
> control when the disk read takes place.
>
> In my context, I'm looking for a way to construct a Table without
> performing the bulk of the IO operations until
I am wondering if the validation step (where Arrow checks for corrupted
data) is causing the full disk IO here. I am almost positive it's turned on
by default (to prevent a crash when consuming untrusted input), but I
*think* there is an option to turn it off if you are processing trusted
input. I
In a different context, fetching batches one-by-one would be a good way to
control when the disk read takes place.
In my context, I'm looking for a way to construct a Table without
performing the bulk of the IO operations until the memory is accessed. I
need random access to the table and my acces
I don't have very specific advice, but mmap() and programmer control don't
come together. The point of mmap is deferring all the logic to the OS and
trusting that it knows better.
If you're calling read_all(), it will do what the name says: read all the
batches. Have you tried looping and getting
11 matches
Mail list logo