Re: mmap only, read data later?

Antoine Pitrou Mon, 09 May 2022 10:01:00 -0700


Hi Andrew,

If the Arrow files are small, chances are the metadata (which is alwaysbeing read) is as large on disk as the actual data (which is "only"mmap'ed). Also, mmap'ing works on a page granularity (a page beingtypically 4 kB on x86, sometimes a bit larger on other architectures),and the kernel will typically read ahead a bit (so when the metadata isread, the kernel would probably read a bit of the data that is laid outjust after).

Generally, the Arrow IPC file/stream formats are designed for largedata. If you have many very small files you might try to rethink how youstore your data on disk.


Regards

Antoine.


Le 09/05/2022 à 18:38, Andrew Piskorski a écrit :

Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
(Ubuntu 18.04.4 LTS).

In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
with as_data_frame=FALSE on each one.  Compressed with lz4, each file
is quite small, often only 25 kB or so, but I'll often be mmap-ing
many thousands of them.  From the time this takes, I suspect that
Arrow is reading the full contents of each file rather than just
setting up the mmap, but I don't know how to properly check that.

I would like to make sure that at this stage, I JUST mmap each file,
and defer reading their data until later when I actually need it.  Are
there any settings or arguments I can use to make sure that happens?
Or ways to verify precisely what is happening?

I think I found the relevant C++ code in "r/src/io.cpp" and
"cpp/src/arrow/io/file.cc", but I definitely don't understand its
performance implications, nor how to control this sort of thing.

Thanks for your help and advice!

Re: mmap only, read data later?

Reply via email to