Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Pierre Lacave Thu, 11 Dec 2025 05:27:28 -0800

Hi all,

I am relatively new to this space, so I apologize if I am missing some context 
or history here. I wanted to share some observations from what I see happening 
with projects like Vortex.


Vortex seems to show that it is possible to support advanced encodings (like 
ALP, FSST, or others) by separating the logical type from the physical 
encoding. If the consumer engine supports the advanced encoding, it stays 
compressed and fast. If not, the data is "canonicalized" to standard Arrow 
arrays at the edge.

As Parquet adopts these novel encodings, the current Arrow approach forces us 
to "densify" or decompress data immediately, even if the engine could have 
operated on the encoded data.

Is there a world where Arrow could offer some sort of negotiation mechanism? 
The goal would be to guarantee the data can always be read as standard "safe" 
physical types (paying a cost only at the boundary), while allowing systems 
that understand the advanced encoding to let the data flow through efficiently.

This sounds like it keep the safety of the interoperability - Arrow making sure 
new encodings have a canonical representation - and it leave the onus of 
implemented the efficient flow to the consumer - decoupling efficiency from 
interoperability.

Thanks !

Pierre

On 2025/12/11 06:49:30 Micah Kornfield wrote:
> I think this is an interesting idea.  Julien, do you have a proposal for
> scope?  Is the intent to be 1:1 with any new encoding that is added to
> Parquet?  For instance would the intent be to also put cascading encodings
> in Arrow?
> 
> We could also experiment with Opaque vectors.
> 
> 
> Did you mean this as a new type? I think this would be necessary for ALP.
> 
> It seems FSSTStringVector/Array could potentially be modelled as an
> extension type (dictionary stored as part of the type metadata?) on top of
> a byte array. This would however require a fixed dictionary, so might not
> be desirable.
> 
> ALPFloatingPointVector and bit-packed vectors/arrays are more challenging
> to represent as extension types.
> 
> 1.  There is no natural alignment with any of the existing types (and the
> bit-packing width can effectively vary by batch).
> 2.  Each batch of values has a different metadata parameter set.
> 
> So it seems there is no easy way out for the ALP encoding and we either
> need to pay the cost of adding a new type (which is not necessarily
> trivial) or we would have to do some work to literally make a new opaque
> "Custom" Type, which would have a buffer that is only interpretable based
> on its extension type.  An easy way of shoe-horning this in would be to add
> a ParquetScalar extension type, which simply contains the decompressed but
> encoded Parquet page with repetition and definition levels stripped out.
> The latter also has its obvious down-sides.
> 
> Cheers,
> Micah
> 
> [1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
> [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> 
> On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <[email protected]> wrote:
> 
> > I forgot to mention that those encodings have the particularity of allowing
> > random access without decoding previous values.
> >
> > On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <[email protected]> wrote:
> >
> > > Hello,
> > > Parquet is in the process of adopting new encodings [1] (Currently in POC
> > > stage), specifically ALP [2] and FSST [3].
> > > One of the discussion items is to allow late materialization: to allow
> > > keeping data in encoded format beyond the filter stage (for example in
> > > Datafusion).
> > > There are several advantages to this:
> > > - For example, if I summarize FSST as a variation of dictionary encoding
> > > on substrings in the values, one can evaluate some operations on encoded
> > > values without decoding them, saving memory and CPU.
> > > - Similarly, simplifying for brevity, ALP converts floating point values
> > > to small integers that are then bitpacked.
> > > The Vortex project argues that keeping encoded values in in-memory
> > vectors
> > > opens up opportunities for performance improvements. [4] a third party
> > blog
> > > argues it's a problem as well [5]
> > >
> > > So I wanted to start a discussion to suggest, we might consider adding
> > > some additional vectors to support such encoded Values like an
> > > FSSTStringVector for example. This would not be too different from the
> > > dictionary encoding, or an ALPFloatingPointVector with a bit packed
> > scheme
> > > not too different from what we use for nullability.
> > > We could also experiment with Opaque vectors.
> > >
> > > For reference, similarly motivated improvements have been done in the
> > past
> > > [6]
> > >
> > > Thoughts?
> > >
> > > See:
> > > [1]
> > >
> > https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > > [2] https://github.com/apache/arrow/pull/48345
> > > [3] https://github.com/apache/arrow/pull/48232
> > > [4] https://docs.vortex.dev/#in-memory
> > > [5]
> > >
> > https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> > > [6]
> > >
> > https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> > >
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to