> Sharvil wants random access to only a few RecordBatches via Table methods, but I don't think that's possible with the Arrow library
The idea (and I believe things worked this way at one point) was that you could memory map a file, read in a bunch of record batches (even an entire table if you want), and you would just have a collection of pointers into the memory mapped file without ever actually loading any of the data into memory. Then, when the data is needed (e.g. when a user calls `table.column(0).chunk(0).value(0)` then the pointers would be dereferenced and, through the magic of memory mapping, the data would be loaded on demand. This loading on demand tends to be inefficient and _not_ what most IPC users are looking for (they just want to read an IPC file and expect they will be accessing the entire file) so I understand why the MADV_WILLNEED is there. However, for users that do want this, I'm not sure if there is any way to achieve the load on demand semantics. > The code you linked specifies a memory region and the proceeding nbytes: Yes, the actual implementation does call ReadAt with a memory region and nbytes. These are then used to create a slice into the underlying memory mapped area. If MADV_WILLNEED was _not_ called then this would be a zero-copy / zero-load operation that doesn't actually load anything from the disk (it's just doing pointer math). On Tue, Jan 28, 2025 at 2:56 PM Aldrin <octalene....@pm.me> wrote: > > ...and that function triggers the MADV_WILLNEED > > The code you linked specifies a memory region and the proceeding nbytes: > ``` > > RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed( > {{memory_map_->data() + position, static_cast<size_t>(nbytes)}})); > return memory_map_->Slice(position, nbytes) > > ``` > > The original question said "Calling `read_all` on a stream triggers a > complete read of the file". So, my impression is that either read_all > (I'm assuming via python) is purposely specifying the whole file, or > eventually (through multiple calls) specifying the whole file. I am curious > how large the file itself is, though I assume it's larger than whatever > size nbytes is defaulted to. > > But, I also can't find which implementation of MemoryMap::Slice [1] is > resolved by memory_map_->Slice(position, nbytes), which I don't think is > likely to be problematic but I can't totally rule out either. > > Either way, if I understand correctly, Sharvil wants random access to only > a few RecordBatches via Table methods, but I don't think that's possible > with the Arrow library; the only ways are to manage accesses at the > RecordBatch level, or maybe using the Dataset or Acero APIs. Or am I > forgetting something... or maybe I'm misunderstanding why Sharvil wants to > specifically construct a Table rather than RecordBatches? > > > [1]: > https://github.com/apache/arrow/blob/apache-arrow-19.0.0/cpp/src/arrow/io/file.h#L216-L217 > > > > # ------------------------------ > # Aldrin > > https://github.com/drin/ > https://gitlab.com/octalene > https://keybase.io/octalene > > On Tuesday, January 28th, 2025 at 13:45, Weston Pace < > weston.p...@gmail.com> wrote: > > I believe the concern is that reading a record batch from a > RecordBatchStreamReader triggers the MADV_WILLNEED advice to be sent to the > OS before any data is accessed (and regardless of whether or not that data > is ever accessed). > > I'm pretty sure the `RecordBatchStreamReader` uses > `MemoryMappedFile::ReadAt` and that function triggers the MADV_WILLNEED[1]. > This is contrary to the user expectation that only the data actually > accessed would be loaded into memory. > > [1] > https://github.com/apache/arrow/blob/ca2f4d68e834e600852d5af36dc2190741e33118/cpp/src/arrow/io/file.cc#L677 > > On Tue, Jan 28, 2025 at 7:15 AM Aldrin <octalene....@pm.me> wrote: > >> > Then you should just use a memory-mapped file. >> >> Unless I'm misunderstanding their original message, I believe they are >> using a memory-mapped file. I'm not sure if other suggestions helped >> address the issue, but my understanding was that they were somehow >> triggering reads against the whole file anyways. >> >> >> I'm not sure why a Table is necessary (presumably some useful method in >> the API?) if accesses are sparse relative to the entire table; that sounds >> more aligned to RecordBatch access. I would think that any use of a Table >> method is going to trigger reads to every batch. I would also think that >> this scenario has 2 opportunities to do processing without triggering a >> scan of the whole file: >> 1. when a RecordBatch is read into memory >> 2. on the RecordBatches accumulated so far (a Table instance can be >> constructed from them without copies, I am pretty sure) >> >> I have little experience with mmap, so I don't have any particular >> thoughts there. Some extra information about how random access into the >> table occurs would be helpful, though. >> >> >> >> Sent from Proton Mail <https://proton.me/mail/home> for iOS >> >> >> On Tue, Jan 28, 2025 at 01:14, Antoine Pitrou < anto...@python.org >> <On+Tue,+Jan+28,+2025+at+01:14,+Antoine+Pitrou+%3C%3Ca+href=>> wrote: >> >> On Sun, 26 Jan 2025 10:48:48 -0800 >> Sharvil Nanavati <shar...@lmnt.com> wrote: >> > In a different context, fetching batches one-by-one would be a good way >> to >> > control when the disk read takes place. >> > >> > In my context, I'm looking for a way to construct a Table without >> > performing the bulk of the IO operations until the memory is accessed. >> I >> > need random access to the table and my accesses are often sparse >> relative >> > to the size of the entire table. Obviously there has to be *some* IO to >> > read the schema and offsets, but that's tiny relative to the data >> itself. >> >> Then you should just use a memory-mapped file. >> >> Regards >> >> Antoine. >> >> >> >