> Or ways to verify precisely what is happening?

Regrettably, mmap is quite difficult to monitor.  With strace you can
verify the mapping is being setup:

    strace -y R --no-save < /tmp/script.R 2>&1 | grep -i foo.arrow
    ...
    mmap(NULL, 490, PROT_READ, MAP_PRIVATE, 3</tmp/foo.arrow>...

Once the mapping is setup then future reads are going to look a lot
like page faults.  Perhaps the easiest thing to do is:

 1. Ensure the file(s) are completely evicted from the OS' kernel cache
 2. Run your test(s)
 3. Use a tool like pcstat[1] to determine what parts of your file are
now in the kernel cache

As Antoine said, you may need to account for a certain amount of OS
level readahead.

[1] https://github.com/tobert/pcstat

On Mon, May 9, 2022 at 7:19 AM Sasha Krassovsky
<krassovskysa...@gmail.com> wrote:
>
> Hi Andrew,
> Unfortunately mmap is made to implement “transparent paging”, meaning that 
> the OS takes control of when to read pages of the file to and from disk. This 
> means that it’s Arrow has no way of controlling when the file is actually 
> read, and it’s possible that the OS is prefetching the whole file given files 
> that small. That said, I’ve seen before that just the act of doing thousands 
> of mmaps can be a significant overhead, as mmap is a fairly expensive system 
> call.
>
> As for solutions, is there some reason you need mmap? Could you perhaps open 
> an InputStream (equivalent to opening each file) for each file and then call 
> read_feather later when you actually need it?
>
> Sasha Krassovsky
>
> > 9 мая 2022 г., в 09:38, Andrew Piskorski <a...@piskorski.com> написал(а):
> >
> > Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux
> > (Ubuntu 18.04.4 LTS).
> >
> > In R, I am mmap-ing many small Arrow files by calling arrow::read_feather()
> > with as_data_frame=FALSE on each one.  Compressed with lz4, each file
> > is quite small, often only 25 kB or so, but I'll often be mmap-ing
> > many thousands of them.  From the time this takes, I suspect that
> > Arrow is reading the full contents of each file rather than just
> > setting up the mmap, but I don't know how to properly check that.
> >
> > I would like to make sure that at this stage, I JUST mmap each file,
> > and defer reading their data until later when I actually need it.  Are
> > there any settings or arguments I can use to make sure that happens?
> > Or ways to verify precisely what is happening?
> >
> > I think I found the relevant C++ code in "r/src/io.cpp" and
> > "cpp/src/arrow/io/file.cc", but I definitely don't understand its
> > performance implications, nor how to control this sort of thing.
> >
> > Thanks for your help and advice!
> >
> > --
> > Andrew Piskorski <a...@piskorski.com>

Reply via email to