Re: Demand-loading Arrow files

Weston Pace Tue, 28 Jan 2025 17:42:43 -0800

> Sharvil wants random access to only a few RecordBatches via Table
methods, but I don't think that's possible with the Arrow library


The idea (and I believe things worked this way at one point) was that you
could memory map a file, read in a bunch of record batches (even an entire
table if you want), and you would just have a collection of pointers into
the memory mapped file without ever actually loading any of the data into
memory.

Then, when the data is needed (e.g. when a user calls
`table.column(0).chunk(0).value(0)` then the pointers would be dereferenced
and, through the magic of memory mapping, the data would be loaded on
demand.  This loading on demand tends to be inefficient and _not_ what most
IPC users are looking for (they just want to read an IPC file and expect
they will be accessing the entire file) so I understand why the
MADV_WILLNEED is there.  However, for users that do want this, I'm not sure
if there is any way to achieve the load on demand semantics.

> The code you linked specifies a memory region and the proceeding nbytes:

Yes, the actual implementation does call ReadAt with a memory region and
nbytes.  These are then used to create a slice into the underlying memory
mapped area.  If MADV_WILLNEED was _not_ called then this would be a
zero-copy / zero-load operation that doesn't actually load anything from
the disk (it's just doing pointer math).


On Tue, Jan 28, 2025 at 2:56 PM Aldrin <octalene....@pm.me> wrote:

> > ...and that function triggers the MADV_WILLNEED
>
> The code you linked specifies a memory region and the proceeding nbytes:
> ```
>
> RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
>       {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
>   return memory_map_->Slice(position, nbytes)
>
> ```
>
> The original question said "Calling `read_all` on a stream triggers a
> complete read of the file". So, my impression is that either read_all
> (I'm assuming via python) is purposely specifying the whole file, or
> eventually (through multiple calls) specifying the whole file. I am curious
> how large the file itself is, though I assume it's larger than whatever
> size nbytes is defaulted to.
>
> But, I also can't find which implementation of MemoryMap::Slice [1] is
> resolved by memory_map_->Slice(position, nbytes), which I don't think is
> likely to be problematic but I can't totally rule out either.
>
> Either way, if I understand correctly, Sharvil wants random access to only
> a few RecordBatches via Table methods, but I don't think that's possible
> with the Arrow library; the only ways are to manage accesses at the
> RecordBatch level, or maybe using the Dataset or Acero APIs. Or am I
> forgetting something... or maybe I'm misunderstanding why Sharvil wants to
> specifically construct a Table rather than RecordBatches?
>
>
> [1]:
> https://github.com/apache/arrow/blob/apache-arrow-19.0.0/cpp/src/arrow/io/file.h#L216-L217
>
>
>
> # ------------------------------
> # Aldrin
>
> https://github.com/drin/
> https://gitlab.com/octalene
> https://keybase.io/octalene
>
> On Tuesday, January 28th, 2025 at 13:45, Weston Pace <
> weston.p...@gmail.com> wrote:
>
> I believe the concern is that reading a record batch from a
> RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the
> OS before any data is accessed (and regardless of whether or not that data
> is ever accessed).
>
> I'm pretty sure the `RecordBatchStreamReader` uses
> `MemoryMappedFile::ReadAt` and that function triggers the MADV_WILLNEED[1].
> This is contrary to the user expectation that only the data actually
> accessed would be loaded into memory.
>
> [1]
> https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/arrow/io/file.cc#L677
>
> On Tue, Jan 28, 2025 at 7:15 AM Aldrin <octalene....@pm.me> wrote:
>
>> > Then you should just use a memory-mapped file.
>>
>> Unless I'm misunderstanding their original message, I believe they are
>> using a memory-mapped file. I'm not sure if other suggestions helped
>> address the issue, but my understanding was that they were somehow
>> triggering reads against the whole file anyways.
>>
>>
>> I'm not sure why a Table is necessary (presumably some useful method in
>> the API?) if accesses are sparse relative to the entire table; that sounds
>> more aligned to RecordBatch access. I would think that any use of a Table
>> method is going to trigger reads to every batch. I would also think that
>> this scenario has 2 opportunities to do processing without triggering a
>> scan of the whole file:
>> 1. when a RecordBatch is read into memory
>> 2. on the RecordBatches accumulated so far (a Table instance can be
>> constructed from them without copies, I am pretty sure)
>>
>> I have little experience with mmap, so I don't have any particular
>> thoughts there. Some extra information about how random access into the
>> table occurs would be helpful, though.
>>
>>
>>
>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>
>>
>> On Tue, Jan 28, 2025 at 01:14, Antoine Pitrou < anto...@python.org
>> <On+Tue,+Jan+28,+2025+at+01:14,+Antoine+Pitrou+%3C%3Ca+href=>> wrote:
>>
>> On Sun, 26 Jan 2025 10:48:48 -0800
>> Sharvil Nanavati <shar...@lmnt.com> wrote:
>> > In a different context, fetching batches one-by-one would be a good way
>> to
>> > control when the disk read takes place.
>> >
>> > In my context, I'm looking for a way to construct a Table without
>> > performing the bulk of the IO operations until the memory is accessed.
>> I
>> > need random access to the table and my accesses are often sparse
>> relative
>> > to the size of the entire table. Obviously there has to be *some* IO to
>> > read the schema and offsets, but that's tiny relative to the data
>> itself.
>>
>> Then you should just use a memory-mapped file.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>
>

Re: Demand-loading Arrow files

Reply via email to