On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote:

> Generally, the Arrow IPC file/stream formats are designed for large 
> data. If you have many very small files you might try to rethink how you 
> store your data on disk.

Ah.  Is this because of the overhead of mmap itself, or the metadata
that must be read separately for each file, (or both)?

Would creating my files with write_dataset() instead of write_feather()
help?  AKA, with write_dataset() and open_dataset(), I'd have fewer
calls to each, but the partitioning of my dataset would give me an
actual layout of files on disk similar to what I have now with
individual Arrow/Feather files.

Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data;
that's what mmap is for after all.  What I DON'T want, is for Arrow to
WAIT for that data to actually be fetched.  Or at least I want it to
wait as little as possible, as presumably it must read some metadata.
Are there ways I should minimize the amount of (possibly redundant)
metadata Arrow needs to read?

-- 
Andrew Piskorski <a...@piskorski.com>

Reply via email to