Hello, Parquet is in the process of adopting new encodings [1] (Currently in POC stage), specifically ALP [2] and FSST [3]. One of the discussion items is to allow late materialization: to allow keeping data in encoded format beyond the filter stage (for example in Datafusion). There are several advantages to this: - For example, if I summarize FSST as a variation of dictionary encoding on substrings in the values, one can evaluate some operations on encoded values without decoding them, saving memory and CPU. - Similarly, simplifying for brevity, ALP converts floating point values to small integers that are then bitpacked. The Vortex project argues that keeping encoded values in in-memory vectors opens up opportunities for performance improvements. [4] a third party blog argues it's a problem as well [5]
So I wanted to start a discussion to suggest, we might consider adding some additional vectors to support such encoded Values like an FSSTStringVector for example. This would not be too different from the dictionary encoding, or an ALPFloatingPointVector with a bit packed scheme not too different from what we use for nullability. We could also experiment with Opaque vectors. For reference, similarly motivated improvements have been done in the past [6] Thoughts? See: [1] https://github.com/apache/parquet-format/tree/master/proposals#active-proposals [2] https://github.com/apache/arrow/pull/48345 [3] https://github.com/apache/arrow/pull/48232 [4] https://docs.vortex.dev/#in-memory [5] https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex [6] https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
