Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Micah Kornfield Wed, 10 Dec 2025 22:49:51 -0800

I think this is an interesting idea.  Julien, do you have a proposal for
scope?  Is the intent to be 1:1 with any new encoding that is added to
Parquet?  For instance would the intent be to also put cascading encodings
in Arrow?

We could also experiment with Opaque vectors.

Did you mean this as a new type? I think this would be necessary for ALP.

It seems FSSTStringVector/Array could potentially be modelled as an
extension type (dictionary stored as part of the type metadata?) on top of
a byte array. This would however require a fixed dictionary, so might not
be desirable.

ALPFloatingPointVector and bit-packed vectors/arrays are more challenging
to represent as extension types.

1.  There is no natural alignment with any of the existing types (and the
bit-packing width can effectively vary by batch).
2.  Each batch of values has a different metadata parameter set.

So it seems there is no easy way out for the ALP encoding and we either
need to pay the cost of adding a new type (which is not necessarily
trivial) or we would have to do some work to literally make a new opaque
"Custom" Type, which would have a buffer that is only interpretable based
on its extension type.  An easy way of shoe-horning this in would be to add
a ParquetScalar extension type, which simply contains the decompressed but
encoded Parquet page with repetition and definition levels stripped out.
The latter also has its obvious down-sides.

Cheers,
Micah

[1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
[2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf

On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <[email protected]> wrote:

> I forgot to mention that those encodings have the particularity of allowing
> random access without decoding previous values.
>
> On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <[email protected]> wrote:
>
> > Hello,
> > Parquet is in the process of adopting new encodings [1] (Currently in POC
> > stage), specifically ALP [2] and FSST [3].
> > One of the discussion items is to allow late materialization: to allow
> > keeping data in encoded format beyond the filter stage (for example in
> > Datafusion).
> > There are several advantages to this:
> > - For example, if I summarize FSST as a variation of dictionary encoding
> > on substrings in the values, one can evaluate some operations on encoded
> > values without decoding them, saving memory and CPU.
> > - Similarly, simplifying for brevity, ALP converts floating point values
> > to small integers that are then bitpacked.
> > The Vortex project argues that keeping encoded values in in-memory
> vectors
> > opens up opportunities for performance improvements. [4] a third party
> blog
> > argues it's a problem as well [5]
> >
> > So I wanted to start a discussion to suggest, we might consider adding
> > some additional vectors to support such encoded Values like an
> > FSSTStringVector for example. This would not be too different from the
> > dictionary encoding, or an ALPFloatingPointVector with a bit packed
> scheme
> > not too different from what we use for nullability.
> > We could also experiment with Opaque vectors.
> >
> > For reference, similarly motivated improvements have been done in the
> past
> > [6]
> >
> > Thoughts?
> >
> > See:
> > [1]
> >
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > [2] https://github.com/apache/arrow/pull/48345
> > [3] https://github.com/apache/arrow/pull/48232
> > [4] https://docs.vortex.dev/#in-memory
> > [5]
> >
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> > [6]
> >
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to