Re: Feather v2 random access

Francois Saint-Jacques Wed, 24 Jun 2020 07:54:31 -0700

Hello Yue,

FeatherV2 is just a facade for the Arrow IPC file format. You can find
the implementation here [1]. I will try to answer your question with
inline comments. On a high level, the file format writes a schema and
then multiple "chunks" called RecordBatch.  Your lowest level of
granularity for fetching data is a RecordBatch [2]. Thus, a Table is
divided into multiple RecordBatch at write-time and the file stores a
series of said batches. When you read a file, you can either read the
whole table, or do point query on RecordBatch, e.g.
`RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the
convenience API for reading the table in a single shot, e.g.
`feather::Reader::Read`, it will decompress all buffers and
materialize everything in memory.


If you use compression, it means copying and decompressioning the
data. In other words, you'll have an RSS of the mmap size +
decompressed size. If you don't use compression, the buffers will be
zero-copy slices of the mmap-ed memory and *could* be lazily loaded
until pointers are dereferenced. But this assumes that the reader code
doesn't dereference, which might not always hold, e.g. sometimes we
call `{Array,RecordBatch,Table}::Validate` to ensure well formed
arrays. This method can
read the buffer for some types to validate that no segfault will
happen at runtime.

IMHO, mmap and compression for the IPC file format are mutually
exclusive. If you use compression, you lose all the benefits of mmap
and you might as well disable mmap. If you want lazy loading and late
memory materialization (from disk), turn off compression.

> 1) If a feather file contains multiple columns, are they compressed
> separately? I assume each column is compressed separately, and instead of
> decompressing the entire feather file, only the accessed column will be
> decompressed, is it correct?

They are compressed separately [3]. The Reader will decompress all
columns of the requested batch. You can pass an option to limit the
number of columns [4] of interest.

> 2) If a particular column value is randomly accessed via the column array's
> index using mmap, will the entire column data be decompressed? I assume
> only a portion of the column will be decompressed, is this correct?

The entire column of the RecordBatch will be decompressed (and stored
in memory). If your table has a single RecordBatch, then yes the whole
column will be decompressed.

> 3) If only part of the column is decompressed, what is the mechanism for
> caching the decompressed data? For example, if we access 10
> contiguous array values, do we need to decompress the column (or part of
> the column) multiple times? What kind of access pattern could be not
> friendly to this cache mechanism?
> 4) If there is an internal caching mechanism, is there any way
> users/developers could tune the cache for different use scenarios, for
> example, some fields may store large text data which may need bigger cache.

There is no caching, the RecordBatchReader yields a fully materialized
batch, it is up to the caller to decide how to handle the lifetime of
such batch.

Long short story,
- it seems that you want lazy materialization via mmap to control the
active memory usage. This is not going to work with compression.
- if you use the ReadTable interface (instead of a stream reader) of
the reader, you get a _fully_ materialized table, i.e. each
RecordBatch is decompressed.

The feather public API loads the whole table, you will need to work
with the IPC interface if you want to do stream reading.

François

[1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc
[2] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90
[3] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255
[4] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85

Re: Feather v2 random access

Reply via email to