@Antoine Pitrou, Good question. I think the answer depends on the concrete encoding scheme.
For some encoding schemes, it is not a good idea to use them for in-memory data compression. For others, it is beneficial to operator directly on the compressed data. For example, it is beneficial to directly work on RLE data, with better locality and fewer cache misses. Best, Liya Fan On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou <anto...@python.org> wrote: > > Le 12/07/2019 à 10:08, Micah Kornfield a écrit : > > OK, I've created a separate thread for data integrity/digests [1], and > > retitled this thread to continue the discussion on compression and > > encodings. As a reminder the PR for the format additions [2] suggested a > > new SparseRecordBatch that would allow for the following features: > > 1. Different data encodings at the Array (e.g. RLE) and Buffer levels > > (e.g. narrower bit-width integers) > > 2. Compression at the buffer level > > 3. Eliding all metadata and data for empty columns. > > So the question is whether this really needs to be in the in-memory > format, i.e. is it desired to operate directly on this compressed > format, or is it solely for transport? > > If the latter, I wonder why Parquet cannot simply be used instead of > reinventing something similar but different. > > Regards > > Antoine. >