> Or ways to verify precisely what is happening? Regrettably, mmap is quite difficult to monitor. With strace you can verify the mapping is being setup:
strace -y R --no-save < /tmp/script.R 2>&1 | grep -i foo.arrow ... mmap(NULL, 490, PROT_READ, MAP_PRIVATE, 3</tmp/foo.arrow>... Once the mapping is setup then future reads are going to look a lot like page faults. Perhaps the easiest thing to do is: 1. Ensure the file(s) are completely evicted from the OS' kernel cache 2. Run your test(s) 3. Use a tool like pcstat[1] to determine what parts of your file are now in the kernel cache As Antoine said, you may need to account for a certain amount of OS level readahead. [1] https://github.com/tobert/pcstat On Mon, May 9, 2022 at 7:19 AM Sasha Krassovsky <krassovskysa...@gmail.com> wrote: > > Hi Andrew, > Unfortunately mmap is made to implement “transparent paging”, meaning that > the OS takes control of when to read pages of the file to and from disk. This > means that it’s Arrow has no way of controlling when the file is actually > read, and it’s possible that the OS is prefetching the whole file given files > that small. That said, I’ve seen before that just the act of doing thousands > of mmaps can be a significant overhead, as mmap is a fairly expensive system > call. > > As for solutions, is there some reason you need mmap? Could you perhaps open > an InputStream (equivalent to opening each file) for each file and then call > read_feather later when you actually need it? > > Sasha Krassovsky > > > 9 мая 2022 г., в 09:38, Andrew Piskorski <a...@piskorski.com> написал(а): > > > > Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux > > (Ubuntu 18.04.4 LTS). > > > > In R, I am mmap-ing many small Arrow files by calling arrow::read_feather() > > with as_data_frame=FALSE on each one. Compressed with lz4, each file > > is quite small, often only 25 kB or so, but I'll often be mmap-ing > > many thousands of them. From the time this takes, I suspect that > > Arrow is reading the full contents of each file rather than just > > setting up the mmap, but I don't know how to properly check that. > > > > I would like to make sure that at this stage, I JUST mmap each file, > > and defer reading their data until later when I actually need it. Are > > there any settings or arguments I can use to make sure that happens? > > Or ways to verify precisely what is happening? > > > > I think I found the relevant C++ code in "r/src/io.cpp" and > > "cpp/src/arrow/io/file.cc", but I definitely don't understand its > > performance implications, nor how to control this sort of thing. > > > > Thanks for your help and advice! > > > > -- > > Andrew Piskorski <a...@piskorski.com>