Re: Demand-loading Arrow files

Aldrin Tue, 28 Jan 2025 14:56:34 -0800

> ...and that function triggers the MADV_WILLNEED


The code you linked specifies a memory region and the proceeding `nbytes`:
```

RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
  return memory_map_->Slice(position, nbytes)

```


The original question said "Calling `read_all` on a stream triggers a complete 
read of the file". So, my impression is that either `read_all` (I'm assuming 
via python) is purposely specifying the whole file, or eventually (through 
multiple calls) specifying the whole file. I am curious how large the file 
itself is, though I assume it's larger than whatever size `nbytes` is defaulted 
to.

But, I also can't find which implementation of `MemoryMap::Slice` [1] is 
resolved by `memory_map_->Slice(position, nbytes)`, which I don't think is 
likely to be problematic but I can't totally rule out either.

Either way, if I understand correctly, Sharvil wants random access to only a 
few RecordBatches via Table methods, but I don't think that's possible with the 
Arrow library; the only ways are to manage accesses at the RecordBatch level, 
or maybe using the Dataset or Acero APIs. Or am I forgetting something... or 
maybe I'm misunderstanding why Sharvil wants to specifically construct a Table 
rather than RecordBatches?


[1]: 
https://github.com/apache/arrow/blob/apache-arrow-19.0.0/cpp/src/arrow/io/file.h#L216-L217





# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Tuesday, January 28th, 2025 at 13:45, Weston Pace <weston.p...@gmail.com> 
wrote:

> I believe the concern is that reading a record batch from a 
> RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the 
> OS before any data is accessed (and regardless of whether or not that data is 
> ever accessed).
> 

> I'm pretty sure the `RecordBatchStreamReader` uses `MemoryMappedFile::ReadAt` 
> and that function triggers the MADV_WILLNEED[1]. This is contrary to the user 
> expectation that only the data actually accessed would be loaded into memory.
> 

> [1] 
> https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/arrow/io/file.cc#L677
> 

> On Tue, Jan 28, 2025 at 7:15 AM Aldrin <octalene....@pm.me> wrote:
> 

> > > Then you should just use a memory-mapped file.
> > 

> > Unless I'm misunderstanding their original message, I believe they are 
> > using a memory-mapped file. I'm not sure if other suggestions helped 
> > address the issue, but my understanding was that they were somehow 
> > triggering reads against the whole file anyways.
> > 

> > 

> > I'm not sure why a Table is necessary (presumably some useful method in the 
> > API?) if accesses are sparse relative to the entire table; that sounds more 
> > aligned to RecordBatch access. I would think that any use of a Table method 
> > is going to trigger reads to every batch. I would also think that this 
> > scenario has 2 opportunities to do processing without triggering a scan of 
> > the whole file:
> > 1. when a RecordBatch is read into memory
> > 2. on the RecordBatches accumulated so far (a Table instance can be 
> > constructed from them without copies, I am pretty sure)
> > 

> > I have little experience with mmap, so I don't have any particular thoughts 
> > there. Some extra information about how random access into the table occurs 
> > would be helpful, though.
> > 

> > 

> > 

> > Sent from Proton Mail for iOS
> > 

> > 

> > On Tue, Jan 28, 2025 at 01:14, Antoine Pitrou < anto...@python.org> wrote:
> > 

> > > On Sun, 26 Jan 2025 10:48:48 -0800
> > > Sharvil Nanavati <shar...@lmnt.com> wrote:
> > > > In a different context, fetching batches one-by-one would be a good way 
> > > > to
> > > > control when the disk read takes place.
> > > >
> > > > In my context, I'm looking for a way to construct a Table without
> > > > performing the bulk of the IO operations until the memory is accessed. I
> > > > need random access to the table and my accesses are often sparse 
> > > > relative
> > > > to the size of the entire table. Obviously there has to be *some* IO to
> > > > read the schema and offsets, but that's tiny relative to the data 
> > > > itself.
> > > 

> > > Then you should just use a memory-mapped file.
> > > 

> > > Regards
> > > 

> > > Antoine.
> > >

publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Demand-loading Arrow files

Reply via email to