Forgot to give the URL: https://github.com/apache/arrow/pull/6005
Regards Antoine. Le 23/01/2020 à 18:23, Antoine Pitrou a écrit : > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : >> Perhaps related to this thread, are there any current or proposed tools to >> transform columns for fixed-length data types according to a "shuffle?" >> For precedent see the implementation of the shuffle filter in hdf5. >> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf >> >> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 >> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 >> 02 00 00 00 03 (I'm writing big-endian even if that is not actually the >> case). >> >> Value(1) would return 00 00 00 02 by referring to some metadata flag that >> the column is shuffled, stitching the bytes back together at call time. >> >> Thus if the column pages were backed by a memory map to something like >> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in >> underlying disk usage due to better run lengths. >> >> It would enable a space/time tradeoff that could be useful? The filesystem >> itself cannot easily do this particular compression transform since it >> benefits from knowing the shape of the data. > > For the record, there's a pull request adding this encoding to the > Parquet C++ specification. > > Regards > > Antoine. >