Again, I know very little about Parquet, so your patience is appreciated. At the moment I can Arrow/mmap a file without having anywhere nearly as much available memory as the file size. I can visit random place in the file (such as a binary search if it is ordered) and only the locations visited by column->Value(i) are paged in. Paging them out happens without my awareness, if necessary.
Does Parquet cover this use-case with the same elegance and at least equal efficiency, or are there more copies/conversions? Perhaps it requires the entire file to be transformed into Arrow memory at the beginning? Or on a batch/block basis? Or to get this I need to use a non-Arrow API for data element access? Etc. IFF it covers the above use-case, which does not mention compression or encoding, then I could consider whether it is interesting on those points. -John On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > What's the point of having zero copy if the OS is doing the > decompression in kernel (which trumps the zero-copy argument)? You > might as well just use parquet without filesystem compression. I > prefer to have compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <j...@jgm.org> wrote: > > > > This could also have utility in memory via things like zram/zswap, right? > > Mac also has a memory compressor? > > > > I don't think Parquet is an option for me unless the integration with > Arrow > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > next to nothing about Parquet. > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > Perhaps related to this thread, are there any current or proposed > tools > > to > > > > transform columns for fixed-length data types according to a > "shuffle?" > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 > 00 > > 00 > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 > 00 > > 00 > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually > the > > > > case). > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > that > > > > the column is shuffled, stitching the bytes back together at call > time. > > > > > > > > Thus if the column pages were backed by a memory map to something > like > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings > in > > > > underlying disk usage due to better run lengths. > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > filesystem > > > > itself cannot easily do this particular compression transform since > it > > > > benefits from knowing the shape of the data. > > > > > > For the record, there's a pull request adding this encoding to the > > > Parquet C++ specification. > > > > > > Regards > > > > > > Antoine. >