Re: Compression?

Wes McKinney Wed, 16 Sep 2020 07:40:45 -0700

On Tue, Sep 15, 2020 at 7:46 PM Jacob Quinn <quinn.jac...@gmail.com> wrote:
>
> Ah, that's where it was.
>
> Ok, so if I understand correctly, individual buffers are compressed, and in
> the Buffer struct, the buffer length is the _compressed_ length? And when
> written, the _uncompressed_ length is first written in 8 bytes, then the
> compressed buffer?


The buffer length in the metadata is the length of the full payload
including the length prefix, and correct on the second point.

> What's the general strategy for dealing with compressed buffers? Uncompress
> the whole thing when deserializing a compressed buffer? Or is decompressing
> delayed until individual elements are accessed? I'm guessing the former
> since it doesn't seem like you'd be able to do random-access into a
> compressed buffer?

It depends on the implementation of course, but in the C++ library we
decompress everything at IPC reconstruction time. One could develop a
RecordBatch-compatible interface to decompress lazily if you wanted.

> -Jacob
>
> On Tue, Sep 15, 2020 at 6:23 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > We have protocol-level compression for message body buffers [1][2]
> > with LZ4 or ZSTD
> >
> > In-memory compression and encoding other than dictionary encoding
> > (like RLE) has been discussed multiple times and remains on the
> > roadmap for the project.
> >
> > [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L45
> >
> > On Tue, Sep 15, 2020 at 7:18 PM Jacob Quinn <quinn.jac...@gmail.com>
> > wrote:
> > >
> > > Am I correct in understanding there's nothing in the arrow ipc/file
> > format
> > > spec about compression? I thought I had seen something at one point, but
> > > looking over the spec website, I don't see anything.
> > >
> > > -Jacob
> >

Re: Compression?

Reply via email to