Hello,
Parquet is in the process of adopting new encodings [1] (Currently in POC
stage), specifically ALP [2] and FSST [3].
One of the discussion items is to allow late materialization: to allow
keeping data in encoded format beyond the filter stage (for example in
Datafusion).
There are several advantages to this:
- For example, if I summarize FSST as a variation of dictionary encoding on
substrings in the values, one can evaluate some operations on encoded
values without decoding them, saving memory and CPU.
- Similarly, simplifying for brevity, ALP converts floating point values to
small integers that are then bitpacked.
The Vortex project argues that keeping encoded values in in-memory vectors
opens up opportunities for performance improvements. [4] a third party blog
argues it's a problem as well [5]

So I wanted to start a discussion to suggest, we might consider adding some
additional vectors to support such encoded Values like an FSSTStringVector
for example. This would not be too different from the dictionary encoding,
or an ALPFloatingPointVector with a bit packed scheme not too different
from what we use for nullability.
We could also experiment with Opaque vectors.

For reference, similarly motivated improvements have been done in the past
[6]

Thoughts?

See:
[1]
https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
[2] https://github.com/apache/arrow/pull/48345
[3] https://github.com/apache/arrow/pull/48232
[4] https://docs.vortex.dev/#in-memory
[5]
https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
[6]
https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Reply via email to