Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Micah Kornfield Sun, 25 Aug 2019 20:31:03 -0700

Hi Ippokratis,
Thank you for the feedback, I have some questions based on the links you
provided.



> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper in the literature is
> http://www.cs.columbia.edu/~orestis/damon15.pdf


This paper seems to suggest more complex encodings I was imagining for the
the first implementation.  Specifically, I proposed using only codes that
are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
the dense bit-packing in an initial version?

>
> I would actually also look into some order-preserving dictionary encodings
> for strings that also allow vectorized processing (predicates, joins, ..)
> on encoded data, e.g. see
> https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
>  .

The IPC spec [1] already has some metadata about the ordering of
dictionaries, but this might not be sufficient.  The paper linked here
seems to recommend two things:
1.  Treating dictionaries as explicit mappings between value and integer
code today is is implicit because the dictionaries are lists indexed by
code.  It seems like for forward-compatibility we should add a type enum to
the Dictionary Encoding metadata.
2.  Adding indexes to the dictionaries.  For this, did you imagine the
indexes would be transferred or something built up on receiving batches?

Arrow can be used as during shuffles for distributed joins/aggs and being
> able to operate on encoded data yields benefits (e.g.
> http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).

The main take-away I got after skimming this paper, as it relates to
encodings, is that encodings (including dictionary) should be dynamic per
batch.  The other interesting question it raises with respect to Arrow is
one of the techniques used is delta-encoding.  I believe delta encoding
requires linear time access.  The dense representations in Arrow was
designed to have constant time access to elements. One open question on how
far we  want to relax this requirement for encoded columns.  My proposal
uses a form of RLE that provide O(Log(N)) access).

Cheers,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285

On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis <ippokra...@gmail.com>
wrote:

> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper in the literature is
> http://www.cs.columbia.edu/~orestis/damon15.pdf
>
> I would actually also look into some order-preserving dictionary encodings
> for strings that also allow vectorized processing (predicates, joins, ..)
> on encoded data, e.g. see
> https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
>  .
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> able to operate on encoded data yields benefits (e.g.
> http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> Thanks,
> -Ippokratis.
>
>
> On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> >
>> > It's not just computation libraries, it's any library peeking inside
>> > Arrow data.  Currently, the Arrow data types are simple, which makes it
>> > easy and non-intimidating to build data processing utilities around
>> > them.  If we start adding sophisticated encodings, we also raise the
>> > cost of supporting Arrow for third-party libraries.
>>
>>
>> This is another legitimate concern about complexity.
>>
>> To try to limit complexity. I simplified the proposal PR [1] to only have
>> 1
>> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
>> scheme (RLE) that I think will have the most benefit if exploited
>> properly.  Compression is removed.
>>
>> I'd like to get closure on the proposal one way or another.  I think now
>> the question to be answered is if we are willing to introduce the
>> additional complexity for the performance improvements they can yield?  Is
>> there more data that people would like to see that would influence their
>> decision?
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/arrow/pull/4815
>>
>> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <solip...@pitrou.net>
>> wrote:
>>
>> > On Mon, 22 Jul 2019 08:40:08 -0700
>> > Brian Hulette <hulet...@gmail.com> wrote:
>> > > To me, the most important aspect of this proposal is the addition of
>> > sparse
>> > > encodings, and I'm curious if there are any more objections to that
>> > > specifically. So far I believe the only one is that it will make
>> > > computation libraries more complicated. This is absolutely true, but I
>> > > think it's worth that cost.
>> >
>> > It's not just computation libraries, it's any library peeking inside
>> > Arrow data.  Currently, the Arrow data types are simple, which makes it
>> > easy and non-intimidating to build data processing utilities around
>> > them.  If we start adding sophisticated encodings, we also raise the
>> > cost of supporting Arrow for third-party libraries.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> >
>>
>
>
> --
> -Ippokratis.
>
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to