Hello Yue, FeatherV2 is just a facade for the Arrow IPC file format. You can find the implementation here [1]. I will try to answer your question with inline comments. On a high level, the file format writes a schema and then multiple "chunks" called RecordBatch. Your lowest level of granularity for fetching data is a RecordBatch [2]. Thus, a Table is divided into multiple RecordBatch at write-time and the file stores a series of said batches. When you read a file, you can either read the whole table, or do point query on RecordBatch, e.g. `RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the convenience API for reading the table in a single shot, e.g. `feather::Reader::Read`, it will decompress all buffers and materialize everything in memory.
If you use compression, it means copying and decompressioning the data. In other words, you'll have an RSS of the mmap size + decompressed size. If you don't use compression, the buffers will be zero-copy slices of the mmap-ed memory and *could* be lazily loaded until pointers are dereferenced. But this assumes that the reader code doesn't dereference, which might not always hold, e.g. sometimes we call `{Array,RecordBatch,Table}::Validate` to ensure well formed arrays. This method can read the buffer for some types to validate that no segfault will happen at runtime. IMHO, mmap and compression for the IPC file format are mutually exclusive. If you use compression, you lose all the benefits of mmap and you might as well disable mmap. If you want lazy loading and late memory materialization (from disk), turn off compression. > 1) If a feather file contains multiple columns, are they compressed > separately? I assume each column is compressed separately, and instead of > decompressing the entire feather file, only the accessed column will be > decompressed, is it correct? They are compressed separately [3]. The Reader will decompress all columns of the requested batch. You can pass an option to limit the number of columns [4] of interest. > 2) If a particular column value is randomly accessed via the column array's > index using mmap, will the entire column data be decompressed? I assume > only a portion of the column will be decompressed, is this correct? The entire column of the RecordBatch will be decompressed (and stored in memory). If your table has a single RecordBatch, then yes the whole column will be decompressed. > 3) If only part of the column is decompressed, what is the mechanism for > caching the decompressed data? For example, if we access 10 > contiguous array values, do we need to decompress the column (or part of > the column) multiple times? What kind of access pattern could be not > friendly to this cache mechanism? > 4) If there is an internal caching mechanism, is there any way > users/developers could tune the cache for different use scenarios, for > example, some fields may store large text data which may need bigger cache. There is no caching, the RecordBatchReader yields a fully materialized batch, it is up to the caller to decide how to handle the lifetime of such batch. Long short story, - it seems that you want lazy materialization via mmap to control the active memory usage. This is not going to work with compression. - if you use the ReadTable interface (instead of a stream reader) of the reader, you get a _fully_ materialized table, i.e. each RecordBatch is decompressed. The feather public API loads the whole table, you will need to work with the IPC interface if you want to do stream reading. François [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc [2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90 [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255 [4] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85