I forgot to mention that you can see how this is glued in
`feather::reader::Read` [1]. This makes it obvious that nothing is
cached and everything is loaded in memory.

François

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/feather.cc#L715-L723

On Wed, Jun 24, 2020 at 10:53 AM Francois Saint-Jacques
<fsaintjacq...@gmail.com> wrote:
>
> Hello Yue,
>
> FeatherV2 is just a facade for the Arrow IPC file format. You can find
> the implementation here [1]. I will try to answer your question with
> inline comments. On a high level, the file format writes a schema and
> then multiple "chunks" called RecordBatch.  Your lowest level of
> granularity for fetching data is a RecordBatch [2]. Thus, a Table is
> divided into multiple RecordBatch at write-time and the file stores a
> series of said batches. When you read a file, you can either read the
> whole table, or do point query on RecordBatch, e.g.
> `RecordBatchFileReader::ReadRecordBatch(int i)`. If you use the
> convenience API for reading the table in a single shot, e.g.
> `feather::Reader::Read`, it will decompress all buffers and
> materialize everything in memory.
>
> If you use compression, it means copying and decompressioning the
> data. In other words, you'll have an RSS of the mmap size +
> decompressed size. If you don't use compression, the buffers will be
> zero-copy slices of the mmap-ed memory and *could* be lazily loaded
> until pointers are dereferenced. But this assumes that the reader code
> doesn't dereference, which might not always hold, e.g. sometimes we
> call `{Array,RecordBatch,Table}::Validate` to ensure well formed
> arrays. This method can
> read the buffer for some types to validate that no segfault will
> happen at runtime.
>
> IMHO, mmap and compression for the IPC file format are mutually
> exclusive. If you use compression, you lose all the benefits of mmap
> and you might as well disable mmap. If you want lazy loading and late
> memory materialization (from disk), turn off compression.
>
> > 1) If a feather file contains multiple columns, are they compressed
> > separately? I assume each column is compressed separately, and instead of
> > decompressing the entire feather file, only the accessed column will be
> > decompressed, is it correct?
>
> They are compressed separately [3]. The Reader will decompress all
> columns of the requested batch. You can pass an option to limit the
> number of columns [4] of interest.
>
> > 2) If a particular column value is randomly accessed via the column array's
> > index using mmap, will the entire column data be decompressed? I assume
> > only a portion of the column will be decompressed, is this correct?
>
> The entire column of the RecordBatch will be decompressed (and stored
> in memory). If your table has a single RecordBatch, then yes the whole
> column will be decompressed.
>
> > 3) If only part of the column is decompressed, what is the mechanism for
> > caching the decompressed data? For example, if we access 10
> > contiguous array values, do we need to decompress the column (or part of
> > the column) multiple times? What kind of access pattern could be not
> > friendly to this cache mechanism?
> > 4) If there is an internal caching mechanism, is there any way
> > users/developers could tune the cache for different use scenarios, for
> > example, some fields may store large text data which may need bigger cache.
>
> There is no caching, the RecordBatchReader yields a fully materialized
> batch, it is up to the caller to decide how to handle the lifetime of
> such batch.
>
> Long short story,
> - it seems that you want lazy materialization via mmap to control the
> active memory usage. This is not going to work with compression.
> - if you use the ReadTable interface (instead of a stream reader) of
> the reader, you get a _fully_ materialized table, i.e. each
> RecordBatch is decompressed.
>
> The feather public API loads the whole table, you will need to work
> with the IPC interface if you want to do stream reading.
>
> François
>
> [1] https://github.com/apache/arrow/tree/master/cpp/src/arrow/ipc
> [2] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.h#L65-L90
> [3] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L113-L255
> [4] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L85

Reply via email to