Re: [DISCUSS] Format additions for encoding/compression

Wes McKinney Thu, 23 Jan 2020 11:10:29 -0800

On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <j...@jgm.org> wrote:
>
> Again, I know very little about Parquet, so your patience is appreciated.
>
> At the moment I can Arrow/mmap a file without having anywhere nearly as
> much available memory as the file size.  I can visit random place in the
> file (such as a binary search if it is ordered) and only the locations
> visited by column->Value(i) are paged in.  Paging them out happens without
> my awareness, if necessary.
>
> Does Parquet cover this use-case with the same elegance and at least equal
> efficiency, or are there more copies/conversions?  Perhaps it requires the
> entire file to be transformed into Arrow memory at the beginning? Or on a
> batch/block basis? Or to get this I need to use a non-Arrow API for data
> element access?  Etc.


Data has to be materialized / deserialized from the Parquet file on a
batch-wise per-column basis. The APIs we provide allow batches of
values to be read for a given subset of columns

>
> IFF it covers the above use-case, which does not mention compression or
> encoding, then I could consider whether it is interesting on those points.

My point really has to do with Parquet's design which is about
reducing file size. In the following blog post

https://ursalabs.org/blog/2019-10-columnar-perf/

I examined a dataset which is about 4GB as raw Arrow stream/file but
only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
if you are working with filesystems that yield < 500MB/s (which
includes pretty much all cloud filesystems AFAIK). In clickstream
analytics this kind of compression ratio is not unusual.

>
> -John
>
> On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > What's the point of having zero copy if the OS is doing the
> > decompression in kernel (which trumps the zero-copy argument)? You
> > might as well just use parquet without filesystem compression. I
> > prefer to have compression algorithm where the columnar engine can
> > benefit from it [1] than marginally improving a file-system-os
> > specific feature.
> >
> > François
> >
> > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> >
> >
> >
> >
> > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <j...@jgm.org> wrote:
> > >
> > > This could also have utility in memory via things like zram/zswap, right?
> > > Mac also has a memory compressor?
> > >
> > > I don't think Parquet is an option for me unless the integration with
> > Arrow
> > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > > next to nothing about Parquet.
> > >
> > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > > >
> > > >
> > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > Perhaps related to this thread, are there any current or proposed
> > tools
> > > to
> > > > > transform columns for fixed-length data types according to a
> > "shuffle?"
> > > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > > >
> > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > >
> > > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> > 00
> > > 00
> > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> > 00
> > > 00
> > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> > the
> > > > > case).
> > > > >
> > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > > that
> > > > > the column is shuffled, stitching the bytes back together at call
> > time.
> > > > >
> > > > > Thus if the column pages were backed by a memory map to something
> > like
> > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> > in
> > > > > underlying disk usage due to better run lengths.
> > > > >
> > > > > It would enable a space/time tradeoff that could be useful?  The
> > > filesystem
> > > > > itself cannot easily do this particular compression transform since
> > it
> > > > > benefits from knowing the shape of the data.
> > > >
> > > > For the record, there's a pull request adding this encoding to the
> > > > Parquet C++ specification.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >

Re: [DISCUSS] Format additions for encoding/compression

Reply via email to