I forgot to mention that those encodings have the particularity of allowing
random access without decoding previous values.

On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <[email protected]> wrote:

> Hello,
> Parquet is in the process of adopting new encodings [1] (Currently in POC
> stage), specifically ALP [2] and FSST [3].
> One of the discussion items is to allow late materialization: to allow
> keeping data in encoded format beyond the filter stage (for example in
> Datafusion).
> There are several advantages to this:
> - For example, if I summarize FSST as a variation of dictionary encoding
> on substrings in the values, one can evaluate some operations on encoded
> values without decoding them, saving memory and CPU.
> - Similarly, simplifying for brevity, ALP converts floating point values
> to small integers that are then bitpacked.
> The Vortex project argues that keeping encoded values in in-memory vectors
> opens up opportunities for performance improvements. [4] a third party blog
> argues it's a problem as well [5]
>
> So I wanted to start a discussion to suggest, we might consider adding
> some additional vectors to support such encoded Values like an
> FSSTStringVector for example. This would not be too different from the
> dictionary encoding, or an ALPFloatingPointVector with a bit packed scheme
> not too different from what we use for nullability.
> We could also experiment with Opaque vectors.
>
> For reference, similarly motivated improvements have been done in the past
> [6]
>
> Thoughts?
>
> See:
> [1]
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> [2] https://github.com/apache/arrow/pull/48345
> [3] https://github.com/apache/arrow/pull/48232
> [4] https://docs.vortex.dev/#in-memory
> [5]
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> [6]
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
>

Reply via email to