don't mind me, I"m confused about default settings in different versions.
The ALP stuff in parquet is interesting, but adds yet-another-compression option, which increases complexity. everyone has read "An Empirical Evaluation of Columnar Storage Formats (Extended Version)", presumably -and if not, now is the time: https://arxiv.org/pdf/2304.05028 Key paragraph on page 9 Parquet has faster decoding than ORC for integer and string columns. As explained in Section 5.2, there are two main reasons behind this: (1) Parquet relies more on the fast Bitpacking and applies RLE less aggressively than ORC, and (2) Parquet has a simpler integer encoding scheme that involves fewer algorithm options. As shown in Table 6, switching between the four integer encoding algorithms in ORC generates 3×more branch mispredictions than Parquet during the decoding process (done on a similar physical machine to collect the performance counters). According to the breakdown in Table 7, ORC has 4× more subsequences to decode than Parquet, and the encoding algo- rithm distribution among the subsequences is unfriendly to branch prediction. Parquet’s decoding-speed advantage over ORC shrinks for integers compared to strings, indicating a (slight) decoding overhead due to its additional dictionary layer for integer columns. Parquet also optimizes the bit-unpacking procedure using SIMD instructions and code generation to avoid unnecessary branches. This highlights that *unless you can use cpu vector opcodes, adding more options can hurt branch prediction and so make overall performance worse*. It's a good argument for simplicity in compression and encoding choices. >>
