I forgot to mention that those encodings have the particularity of allowing random access without decoding previous values.
On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <[email protected]> wrote: > Hello, > Parquet is in the process of adopting new encodings [1] (Currently in POC > stage), specifically ALP [2] and FSST [3]. > One of the discussion items is to allow late materialization: to allow > keeping data in encoded format beyond the filter stage (for example in > Datafusion). > There are several advantages to this: > - For example, if I summarize FSST as a variation of dictionary encoding > on substrings in the values, one can evaluate some operations on encoded > values without decoding them, saving memory and CPU. > - Similarly, simplifying for brevity, ALP converts floating point values > to small integers that are then bitpacked. > The Vortex project argues that keeping encoded values in in-memory vectors > opens up opportunities for performance improvements. [4] a third party blog > argues it's a problem as well [5] > > So I wanted to start a discussion to suggest, we might consider adding > some additional vectors to support such encoded Values like an > FSSTStringVector for example. This would not be too different from the > dictionary encoding, or an ALPFloatingPointVector with a bit packed scheme > not too different from what we use for nullability. > We could also experiment with Opaque vectors. > > For reference, similarly motivated improvements have been done in the past > [6] > > Thoughts? > > See: > [1] > https://github.com/apache/parquet-format/tree/master/proposals#active-proposals > [2] https://github.com/apache/arrow/pull/48345 > [3] https://github.com/apache/arrow/pull/48232 > [4] https://docs.vortex.dev/#in-memory > [5] > https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex > [6] > https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/ >
