Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Fan Liya Fri, 12 Jul 2019 03:31:01 -0700

@Antoine Pitrou,

Good question. I think the answer depends on the concrete encoding scheme.


For some encoding schemes, it is not a good idea to use them for in-memory
data compression.
For others, it is beneficial to operator directly on the compressed data.

For example, it is beneficial to directly work on RLE data, with better
locality and fewer cache misses.

Best,
Liya Fan

On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to