Re: compressed feather v2 "slicing from the middle"

John Muehlhausen Thu, 22 Sep 2022 07:53:49 -0700

If I'm understanding the below correctly, it seems that the file format
supports finding an arbitrary compressed buffer without decompressing
anything else.  Correct?


-John

/// ----------------------------------------------------------------------
/// A Buffer represents a single contiguous memory segment
struct Buffer {
  /// The relative offset into the shared memory page where the bytes for
this
  /// buffer starts
  offset: long;

  /// The absolute length (in bytes) of the memory buffer. The memory is
found
  /// from offset (inclusive) to offset + length (non-inclusive). When
building
  /// messages using the encapsulated IPC message, padding bytes may be
written
  /// after a buffer, but such padding bytes do not need to be accounted
for in
  /// the size here.
  length: long;
}

On Thu, Sep 22, 2022 at 9:10 AM John Muehlhausen <j...@jgm.org> wrote:

> Regarding tab=feather.read_table(fname, memory_map=True)
>
> Uncompressed: low-cost setup and len(tab), data is read when sections of
> the map are "paged in" by the OS
> Compressed (desired):
>   * low-cost setup
>   * read the length of the "table" without decompressing anything .... (
> len(tab) )
>   * low-cost slicing
>   * data is decompressed only on dereference, such that there is a "batch
> column cache" corresponding to the lower level page cache concept; batch
> columns are only decompressed when they are accessed. Repeated access does
> not re-decompress (subject to LRU column cache capacity).
>
> The above design would allow drop-in replacement of compressed files for
> uncompressed files for workloads that depend on the characteristics of
> uncompressed mmap, it seems to me.  The only difference would be two levels
> of "paging in" (page cache -> a new column cache on the heap) instead of
> just the OS page cache level.
>
> Is there anything about the file format that would make this impossible?
>  *For example, do we know where each compressed buffer starts without
> needing to decompress buffers around it?*
>
> Thanks,
> John
>
> On Wed, Sep 21, 2022 at 11:53 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
>> Hi,
>>
>> AFAIK compressed IPC arrow files do not support random access (like
>> uncompressed counterparts) - you need to decompress the whole batch (or at
>> least the columns you need). A "RecordBatch" is the compression unit of
>> the
>> file. Think of it like a parquet file whose every row group has a single
>> data page.
>>
>> With that said, the file contains the number of rows that each record
>> batch
>> has, and thus you can compute which record batch you can start from. This
>> is available in the RecordBatch message "length" [1]. Unfortunately you do
>> need to read the record batch message to know it, and these are located in
>> the middle of the file:
>>
>> [header][record_batch1_message][record_batch1_data][record_batch2_message][record_batch2_data]...[footer].
>> The positions of the messages are declared in the file's footer's
>> "record_batches".
>>
>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87
>>
>> Best,
>> Jorge
>>
>>
>> On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen <j...@jgm.org> wrote:
>>
>> > Why aren't all the compressed batches the chunk size I specified in
>> > write_feather (7000000)?  How can I know which batch my slice resides
>> in if
>> > this is not a constant?  Using pyarrow 9.0.0
>> >
>> > This file contains 1.5 billion rows.  I need a way to know where to look
>> > for, say, [780567127,922022522)
>> >
>> > 0.7492516040802002 done 0 len 7000000
>> > 1.7520167827606201 done 1 len 7000000
>> > 3.302407741546631 done 2 len 4995912
>> > 5.16457986831665 done 3 len 7000000
>> > 6.0424370765686035 done 4 len 4706276
>> > 7.58642315864563 done 5 len 7000000
>> > 7.719322681427002 done 6 len 289636
>> > 8.705692291259766 done 7 len 5698775
>> >
>> > On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen <j...@jgm.org> wrote:
>> >
>> > > The following seems like good news... like I should be able to
>> decompress
>> > > just one column of a RecordBatch in the middle of a compressed
>> feather v2
>> > > file.  Is there a Python API for this kind of access?  C++?
>> > >
>> > > /// Provided for forward compatibility in case we need to support
>> > different
>> > > /// strategies for compressing the IPC message body (like whole-body
>> > > /// compression rather than buffer-level) in the future
>> > > enum BodyCompressionMethod:byte {
>> > >   /// Each constituent buffer is first compressed with the indicated
>> > >   /// compressor, and then written with the uncompressed length in the
>> > > first 8
>> > >   /// bytes as a 64-bit little-endian signed integer followed by the
>> > > compressed
>> > >   /// buffer bytes (and then padding as required by the protocol). The
>> > >   /// uncompressed length may be set to -1 to indicate that the data
>> that
>> > >   /// follows is not compressed, which can be useful for cases where
>> > >   /// compression does not yield appreciable savings.
>> > >   BUFFER
>> > > }
>> > >
>> > > On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen <j...@jgm.org> wrote:
>> > >
>> > >> ``Internal structure supports random access and slicing from the
>> middle.
>> > >> This also means that you can read a large file chunk by chunk without
>> > >> having to pull the whole thing into memory.''
>> > >> https://ursalabs.org/blog/2020-feather-v2/
>> > >>
>> > >> For a compressed v2 file, can I decompress just one column of a
>> batch in
>> > >> the middle, or is the entire batch with all of its columns compressed
>> > as a
>> > >> unit?
>> > >>
>> > >> Unfortunately reader.get_batch(i) seems like it is doing a lot of
>> work.
>> > >> Like maybe decompressing all the columns?
>> > >>
>> > >> Thanks,
>> > >> John
>> > >>
>> > >
>> >
>>
>

Re: compressed feather v2 "slicing from the middle"

Reply via email to