On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote: > Generally, the Arrow IPC file/stream formats are designed for large > data. If you have many very small files you might try to rethink how you > store your data on disk.
Ah. Is this because of the overhead of mmap itself, or the metadata that must be read separately for each file, (or both)? Would creating my files with write_dataset() instead of write_feather() help? AKA, with write_dataset() and open_dataset(), I'd have fewer calls to each, but the partitioning of my dataset would give me an actual layout of files on disk similar to what I have now with individual Arrow/Feather files. Btw, I have no problem if Linux decides to pre-fetch my mmap-ed data; that's what mmap is for after all. What I DON'T want, is for Arrow to WAIT for that data to actually be fetched. Or at least I want it to wait as little as possible, as presumably it must read some metadata. Are there ways I should minimize the amount of (possibly redundant) metadata Arrow needs to read? -- Andrew Piskorski <a...@piskorski.com>