Hi there, I am evaluating using feather v2 on disk to store some data that needs random access. I did some experiments to see the performance, but since there are many scenarios I cannot verify each of them, I am searching for some details about how it works internally to understand if it satisfies my requirements, in particular about the random access and its compression/decompression, but I am not able to find any documentation describing it. Could someone shed some light on this?
So far I read some Arrow source code and PRs like the two below but I still have no idea how it works internally (it is likely because I am not familiar with Flatbuffers) * ARROW-300: [Format] Proposal for "trivial" IPC body buffer compression using either LZ4 or ZSTD codecs, https://github.com/apache/arrow/pull/6707 * ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format, https://github.com/apache/arrow/pull/6694 I would like to understand how in general Feather v2 performs decompression when randomly accessed via mmap, and have some specific questions below: 1) If a feather file contains multiple columns, are they compressed separately? I assume each column is compressed separately, and instead of decompressing the entire feather file, only the accessed column will be decompressed, is it correct? 2) If a particular column value is randomly accessed via the column array's index using mmap, will the entire column data be decompressed? I assume only a portion of the column will be decompressed, is this correct? 3) If only part of the column is decompressed, what is the mechanism for caching the decompressed data? For example, if we access 10 contiguous array values, do we need to decompress the column (or part of the column) multiple times? What kind of access pattern could be not friendly to this cache mechanism? 4) If there is an internal caching mechanism, is there any way users/developers could tune the cache for different use scenarios, for example, some fields may store large text data which may need bigger cache. And besides the above questions, I would like to learn more details about this and it will be great if someone could point me to any documentation or certain part of the source code that I should check out. Any help is appreciated. Thanks. Regards, Yue