Re: [DISCUSS] Format additions for encoding/compression

John Muehlhausen Thu, 23 Jan 2020 11:36:27 -0800

Wes, what do you think about Arrow supporting a new suite of fixed-length
data types that unshuffle on column->Value(i) calls?  This would allow
memory/swap compressors and memory maps backed by compressing
filesystems (ZFS) or block devices (VDO) to operate more efficiently.


By doing it with new datatypes there is no separate flag to check?

On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <[email protected]> wrote:

> On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <[email protected]> wrote:
> >
> > Again, I know very little about Parquet, so your patience is appreciated.
> >
> > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > much available memory as the file size.  I can visit random place in the
> > file (such as a binary search if it is ordered) and only the locations
> > visited by column->Value(i) are paged in.  Paging them out happens
> without
> > my awareness, if necessary.
> >
> > Does Parquet cover this use-case with the same elegance and at least
> equal
> > efficiency, or are there more copies/conversions?  Perhaps it requires
> the
> > entire file to be transformed into Arrow memory at the beginning? Or on a
> > batch/block basis? Or to get this I need to use a non-Arrow API for data
> > element access?  Etc.
>
> Data has to be materialized / deserialized from the Parquet file on a
> batch-wise per-column basis. The APIs we provide allow batches of
> values to be read for a given subset of columns
>
> >
> > IFF it covers the above use-case, which does not mention compression or
> > encoding, then I could consider whether it is interesting on those
> points.
>
> My point really has to do with Parquet's design which is about
> reducing file size. In the following blog post
>
> https://ursalabs.org/blog/2019-10-columnar-perf/
>
> I examined a dataset which is about 4GB as raw Arrow stream/file but
> only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> if you are working with filesystems that yield < 500MB/s (which
> includes pretty much all cloud filesystems AFAIK). In clickstream
> analytics this kind of compression ratio is not unusual.
>
> >
> > -John
> >
> > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > [email protected]> wrote:
> >
> > > What's the point of having zero copy if the OS is doing the
> > > decompression in kernel (which trumps the zero-copy argument)? You
> > > might as well just use parquet without filesystem compression. I
> > > prefer to have compression algorithm where the columnar engine can
> > > benefit from it [1] than marginally improving a file-system-os
> > > specific feature.
> > >
> > > François
> > >
> > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > >
> > >
> > >
> > >
> > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <[email protected]> wrote:
> > > >
> > > > This could also have utility in memory via things like zram/zswap,
> right?
> > > > Mac also has a memory compressor?
> > > >
> > > > I don't think Parquet is an option for me unless the integration with
> > > Arrow
> > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> know
> > > > next to nothing about Parquet.
> > > >
> > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <[email protected]>
> > > wrote:
> > > > >
> > > > >
> > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > Perhaps related to this thread, are there any current or proposed
> > > tools
> > > > to
> > > > > > transform columns for fixed-length data types according to a
> > > "shuffle?"
> > > > > >  For precedent see the implementation of the shuffle filter in
> hdf5.
> > > > > >
> > > >
> > >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > >
> > > > > > For example, the column (length 3) would store bytes 00 00 00 00
> 00
> > > 00
> > > > 00
> > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01
> 00
> > > 00
> > > > 00
> > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> actually
> > > the
> > > > > > case).
> > > > > >
> > > > > > Value(1) would return 00 00 00 02 by referring to some metadata
> flag
> > > > that
> > > > > > the column is shuffled, stitching the bytes back together at call
> > > time.
> > > > > >
> > > > > > Thus if the column pages were backed by a memory map to something
> > > like
> > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> savings
> > > in
> > > > > > underlying disk usage due to better run lengths.
> > > > > >
> > > > > > It would enable a space/time tradeoff that could be useful?  The
> > > > filesystem
> > > > > > itself cannot easily do this particular compression transform
> since
> > > it
> > > > > > benefits from knowing the shape of the data.
> > > > >
> > > > > For the record, there's a pull request adding this encoding to the
> > > > > Parquet C++ specification.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > >
>

Re: [DISCUSS] Format additions for encoding/compression

Reply via email to