On Tue, Sep 15, 2020 at 7:46 PM Jacob Quinn <quinn.jac...@gmail.com> wrote: > > Ah, that's where it was. > > Ok, so if I understand correctly, individual buffers are compressed, and in > the Buffer struct, the buffer length is the _compressed_ length? And when > written, the _uncompressed_ length is first written in 8 bytes, then the > compressed buffer?
The buffer length in the metadata is the length of the full payload including the length prefix, and correct on the second point. > What's the general strategy for dealing with compressed buffers? Uncompress > the whole thing when deserializing a compressed buffer? Or is decompressing > delayed until individual elements are accessed? I'm guessing the former > since it doesn't seem like you'd be able to do random-access into a > compressed buffer? It depends on the implementation of course, but in the C++ library we decompress everything at IPC reconstruction time. One could develop a RecordBatch-compatible interface to decompress lazily if you wanted. > -Jacob > > On Tue, Sep 15, 2020 at 6:23 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > We have protocol-level compression for message body buffers [1][2] > > with LZ4 or ZSTD > > > > In-memory compression and encoding other than dictionary encoding > > (like RLE) has been discussed multiple times and remains on the > > roadmap for the project. > > > > [1]: https://github.com/apache/arrow/blob/master/format/Message.fbs#L45 > > > > On Tue, Sep 15, 2020 at 7:18 PM Jacob Quinn <quinn.jac...@gmail.com> > > wrote: > > > > > > Am I correct in understanding there's nothing in the arrow ipc/file > > format > > > spec about compression? I thought I had seen something at one point, but > > > looking over the spec website, I don't see anything. > > > > > > -Jacob > >