Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

John Muehlhausen Thu, 23 Jan 2020 09:17:20 -0800

Perhaps related to this thread, are there any current or proposed tools to
transform columns for fixed-length data types according to a "shuffle?"
 For precedent see the implementation of the shuffle filter in hdf5.
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf


For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
02 00 00 00 03  (I'm writing big-endian even if that is not actually the
case).

Value(1) would return 00 00 00 02 by referring to some metadata flag that
the column is shuffled, stitching the bytes back together at call time.

Thus if the column pages were backed by a memory map to something like
zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
underlying disk usage due to better run lengths.

It would enable a space/time tradeoff that could be useful?  The filesystem
itself cannot easily do this particular compression transform since it
benefits from knowing the shape of the data.

-John

On Sun, Aug 25, 2019 at 10:30 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:
>
> Hi Ippokratis,
> Thank you for the feedback, I have some questions based on the links you
> provided.
>
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
>
>
> This paper seems to suggest more complex encodings I was imagining for the
> the first implementation.  Specifically, I proposed using only codes that
> are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
> the dense bit-packing in an initial version?
>
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
>
> The IPC spec [1] already has some metadata about the ordering of
> dictionaries, but this might not be sufficient.  The paper linked here
> seems to recommend two things:
> 1.  Treating dictionaries as explicit mappings between value and integer
> code today is is implicit because the dictionaries are lists indexed by
> code.  It seems like for forward-compatibility we should add a type enum
to
> the Dictionary Encoding metadata.
> 2.  Adding indexes to the dictionaries.  For this, did you imagine the
> indexes would be transferred or something built up on receiving batches?
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> The main take-away I got after skimming this paper, as it relates to
> encodings, is that encodings (including dictionary) should be dynamic per
> batch.  The other interesting question it raises with respect to Arrow is
> one of the techniques used is delta-encoding.  I believe delta encoding
> requires linear time access.  The dense representations in Arrow was
> designed to have constant time access to elements. One open question on
how
> far we  want to relax this requirement for encoded columns.  My proposal
> uses a form of RLE that provide O(Log(N)) access).
>
> Cheers,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285
>
> On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis <ippokra...@gmail.com>
> wrote:
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
> >
> > Arrow can be used as during shuffles for distributed joins/aggs and
being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
> >
> > Thanks,
> > -Ippokratis.
> >
> >
> > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> >
> >> > It's not just computation libraries, it's any library peeking inside
> >> > Arrow data.  Currently, the Arrow data types are simple, which makes
it
> >> > easy and non-intimidating to build data processing utilities around
> >> > them.  If we start adding sophisticated encodings, we also raise the
> >> > cost of supporting Arrow for third-party libraries.
> >>
> >>
> >> This is another legitimate concern about complexity.
> >>
> >> To try to limit complexity. I simplified the proposal PR [1] to only
have
> >> 1
> >> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array
encoding
> >> scheme (RLE) that I think will have the most benefit if exploited
> >> properly.  Compression is removed.
> >>
> >> I'd like to get closure on the proposal one way or another.  I think
now
> >> the question to be answered is if we are willing to introduce the
> >> additional complexity for the performance improvements they can
yield?  Is
> >> there more data that people would like to see that would influence
their
> >> decision?
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://github.com/apache/arrow/pull/4815
> >>
> >> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <solip...@pitrou.net>
> >> wrote:
> >>
> >> > On Mon, 22 Jul 2019 08:40:08 -0700
> >> > Brian Hulette <hulet...@gmail.com> wrote:
> >> > > To me, the most important aspect of this proposal is the addition
of
> >> > sparse
> >> > > encodings, and I'm curious if there are any more objections to that
> >> > > specifically. So far I believe the only one is that it will make
> >> > > computation libraries more complicated. This is absolutely true,
but I
> >> > > think it's worth that cost.
> >> >
> >> > It's not just computation libraries, it's any library peeking inside
> >> > Arrow data.  Currently, the Arrow data types are simple, which makes
it
> >> > easy and non-intimidating to build data processing utilities around
> >> > them.  If we start adding sophisticated encodings, we also raise the
> >> > cost of supporting Arrow for third-party libraries.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >> >
> >> >
> >> >
> >>
> >
> >
> > --
> > -Ippokratis.
> >
> >

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Reply via email to