Re: [DISCUSS] Format additions for encoding/compression

Antoine Pitrou Thu, 23 Jan 2020 09:24:20 -0800


Forgot to give the URL:
https://github.com/apache/arrow/pull/6005


Regards

Antoine.


Le 23/01/2020 à 18:23, Antoine Pitrou a écrit :
> 
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> Perhaps related to this thread, are there any current or proposed tools to
>> transform columns for fixed-length data types according to a "shuffle?"
>>  For precedent see the implementation of the shuffle filter in hdf5.
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>>
>> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
>> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
>> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
>> case).
>>
>> Value(1) would return 00 00 00 02 by referring to some metadata flag that
>> the column is shuffled, stitching the bytes back together at call time.
>>
>> Thus if the column pages were backed by a memory map to something like
>> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
>> underlying disk usage due to better run lengths.
>>
>> It would enable a space/time tradeoff that could be useful?  The filesystem
>> itself cannot easily do this particular compression transform since it
>> benefits from knowing the shape of the data.
> 
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
> 
> Regards
> 
> Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression

Reply via email to