Hi Andrew,

If the Arrow files are small, chances are the metadata (which is always being read) is as large on disk as the actual data (which is "only" mmap'ed). Also, mmap'ing works on a page granularity (a page being typically 4 kB on x86, sometimes a bit larger on other architectures), and the kernel will typically read ahead a bit (so when the metadata is read, the kernel would probably read a bit of the data that is laid out just after).

Generally, the Arrow IPC file/stream formats are designed for large data. If you have many very small files you might try to rethink how you store your data on disk.

Regards

Antoine.


Le 09/05/2022 à 18:38, Andrew Piskorski a écrit :
Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
(Ubuntu 18.04.4 LTS).

In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
with as_data_frame=FALSE on each one.  Compressed with lz4, each file
is quite small, often only 25 kB or so, but I'll often be mmap-ing
many thousands of them.  From the time this takes, I suspect that
Arrow is reading the full contents of each file rather than just
setting up the mmap, but I don't know how to properly check that.

I would like to make sure that at this stage, I JUST mmap each file,
and defer reading their data until later when I actually need it.  Are
there any settings or arguments I can use to make sure that happens?
Or ways to verify precisely what is happening?

I think I found the relevant C++ code in "r/src/io.cpp" and
"cpp/src/arrow/io/file.cc", but I definitely don't understand its
performance implications, nor how to control this sort of thing.

Thanks for your help and advice!

Reply via email to