Hi there,

I am evaluating using feather v2 on disk to store some data that needs
random access. I did some experiments to see the performance, but since
there are many scenarios I cannot verify each of them, I am searching for
some details about how it works internally to understand if it satisfies my
requirements, in particular about the random access and its
compression/decompression, but I am not able to find any documentation
describing it. Could someone shed some light on this?

So far I read some Arrow source code and PRs like the two below but I still
have no idea how it works internally (it is likely because I am not
familiar with Flatbuffers)
* ARROW-300: [Format] Proposal for "trivial" IPC body buffer compression
using either LZ4 or ZSTD codecs, https://github.com/apache/arrow/pull/6707
* ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC
file format, https://github.com/apache/arrow/pull/6694

I would like to understand how in general Feather v2 performs decompression
when randomly accessed via mmap, and have some specific questions below:
1) If a feather file contains multiple columns, are they compressed
separately? I assume each column is compressed separately, and instead of
decompressing the entire feather file, only the accessed column will be
decompressed, is it correct?
2) If a particular column value is randomly accessed via the column array's
index using mmap, will the entire column data be decompressed? I assume
only a portion of the column will be decompressed, is this correct?
3) If only part of the column is decompressed, what is the mechanism for
caching the decompressed data? For example, if we access 10
contiguous array values, do we need to decompress the column (or part of
the column) multiple times? What kind of access pattern could be not
friendly to this cache mechanism?
4) If there is an internal caching mechanism, is there any way
users/developers could tune the cache for different use scenarios, for
example, some fields may store large text data which may need bigger cache.

And besides the above questions, I would like to learn more details about
this and it will be great if someone could point me to any documentation or
certain part of the source code that I should check out. Any help is
appreciated. Thanks.

Regards,
Yue

Reply via email to