don't mind me, I"m confused about default settings in different versions.

The ALP stuff in parquet is interesting, but adds yet-another-compression
option, which increases complexity.

everyone has read "An Empirical Evaluation of Columnar Storage Formats
(Extended Version)", presumably -and if not, now is the time:
https://arxiv.org/pdf/2304.05028

Key paragraph on page 9

 Parquet has faster decoding than ORC
for integer and string columns. As explained in Section 5.2, there
are two main reasons behind this: (1) Parquet relies more on the
fast Bitpacking and applies RLE less aggressively than ORC, and
(2) Parquet has a simpler integer encoding scheme that involves
fewer algorithm options. As shown in Table 6, switching between
the four integer encoding algorithms in ORC generates 3×more
branch mispredictions than Parquet during the decoding process
(done on a similar physical machine to collect the performance
counters). According to the breakdown in Table 7, ORC has 4×
more subsequences to decode than Parquet, and the encoding algo-
rithm distribution among the subsequences is unfriendly to branch
prediction. Parquet’s decoding-speed advantage over ORC shrinks
for integers compared to strings, indicating a (slight) decoding
overhead due to its additional dictionary layer for integer columns.
Parquet also optimizes the bit-unpacking procedure using SIMD
instructions and code generation to avoid unnecessary branches.


This highlights that *unless you can use cpu vector opcodes, adding more
options can hurt branch prediction and so make overall performance worse*.

It's a good argument for simplicity in compression and encoding choices.


>>

Reply via email to