Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Felipe Oliveira Carvalho Wed, 07 Jan 2026 07:44:47 -0800

> My take is that "1a" is probably the easy part.  1b and 2a are the hard
parts because there have been a lot of core assumptions baked into
libraries about the structure of  Arrow array.


I think these are impossible to solve problems and have the very real
implication that Arrow becomes less effective as an interoperability format.

This whole discussion is assuming Arrow as a bottleneck to Parquet
evolution, but there is no reason to constrain Parquet libraries to *always
and only produce Arrow arrays*. Parquet libraries should provide
lower-level APIs that produce some *Parquet-specific form of in-memory
arrays decoupled from the Arrow spec.* The main API that produces Arrow
arrays turns these into Arrow arrays more eagerly. Users of the lower-level
API can deal directly with the Parquet-specific in-memory arrays.

--
Felipe

On Thu, Dec 18, 2025 at 10:38 PM Micah Kornfield <[email protected]>
wrote:

> Hi Pierre,
>
> I'm somewhat expanding on Weston's  and Felipe's points above, but wanted
> to make it explicit.
>
> There are really two problems being discussed.
>
> 1. Arrow's representation of the data (either in memory or on disk).
>
> I think the main complication brought up here is Arrow data stream model
> couples physical and logical type.  Making it hard to switch to different
> encodings between different batches of data.  I think there are two main
> parts to resolving this:
>     a.  Suggest a modelling in the flatbuffer model (e.g.
> https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi (
> https://arrow.apache.org/docs/format/CDataInterface.html) to disentangle
> the two of them.
>     b.  Update libraries to accommodate the decoupling, so data can be
> exchanged with the new modelling.
>
>  2. How implementations carry this through and don't end up breaking
> existing functionality around Arrow structures
>
>    a. As Felipe pointed out I think at least a few implementations of
> operations don't scale well to adding multiple different encodings.
> Rethinking the architecture here might be required.
>
>
> My take is that "1a" is probably the easy part.  1b and 2a are the hard
> parts because there have been a lot of core assumptions baked into
> libraries about the structure of  Arrow array.  If we can convince
> ourselves that there is a tractable path forward for hard parts I would
> guess it would go a long way to convincing skeptics.  The easiest way of
> proving or disproving something like this would be trying to prototype
> something in at least one of the libraries to see what the complexity looks
> like (e.g. in arrow-rs or arrow-cpp which have sizable compute
> implementations).
>
> In short, the spec changes probably are possibly not super large, but the
> downstream implications of them are.  As you and others have pointed out, I
> think it is worth trying to tackle this issue for the long term viability
> of the project, but I'm concerned there might not be enough developer
> bandwidth to make it happen.
>
> Cheers,
> Micah
>
>
>
>
> On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]> wrote:
>
> > I've been following this thread with interest and wanted to share a more
> > specific concern regarding the potential overhead of data expansion
> where I
> > think this really matters in our case.
> >
> >
> > It seems that as Parquet adopts more sophisticated, non-data-dependent
> > encodings like ALP and FSST, there might be a risk of Arrow becoming a
> > bottleneck if the format requires immediate "densification" into standard
> > arrays.
> >
> > I am particularly thinking about use cases like compaction or merging of
> > multiple files. In these scenarios, it feels like we might be paying a
> > significant CPU and memory tax to expand encoded Parquet data into a flat
> > Arrow representation, even if the operation itself, essentially a
> > pass-through or a reorganization and might not strictly require that
> > expansion.
> >
> > I (think) I understand the concerns with adding more types and late
> > materialisation concepts.
> >
> > However I'm curious if there is a path for Arrow to evolve that would
> > allow for this kind of efficiency avoiding said complexity , perhaps by:
> >   * Potentially exploring a negotiation mechanism where the consumer can
> > opt-out of full "densification" if it understands the underlying
> encoding -
> > the onus of compute left to the user.
> >   * Considering if "semi-opaque" or encoded vectors could be supported
> for
> > operations that don't require full decoding, such as simple slices or
> > merges.
> >
> >
> > If the goal of Arrow is to remain the gold standard for interoperability,
> > I wonder if it might eventually need to account for these "on-encoded
> data"
> > patterns to keep pace with the efficiencies found in modern storage
> formats.
> >
> > I'd be very interested to hear if this is a direction the community feels
> > is viable or if these optimizations are viewed as being better handled
> > strictly within the compute engines themselves.
> >
> > I am (still) quite new to Arrow and still catching up on the nuances of
> > the specification, so please excuse any naive assumptions in my
> reasoning.
> >
> > Best!
> >
> > On 2025/12/18 17:55:58 Weston Pace wrote:
> > > > the real challenge is having compute kernels that are complete
> > > > so they can be flexible enough to handle all the physical forms
> > > > of a logical type
> > >
> > > Agreed.  My experience has also been that even when compute systems
> claim
> > > to solve this problem (e.g. DuckDb) the implementations are still
> rather
> > > limited.  Part of the problem is also, I think, that only a small
> > fraction
> > > of cases are going to be compute bound in a way that this sort of
> > > optimization will help.  Beyond major vendors (e.g. Databricks,
> > Snowflake,
> > > Meta, ...) the scale just isn't there to justify the cost of
> development
> > > and maintenance.
> > >
> > > > The definition of any logical type can only be coupled to a compute
> > > system.
> > >
> > > I guess I disagree here.  We did this once already in Substrait.  I
> think
> > > there is value in an interoperable agreement of what constitutes a
> > logical
> > > type.  Maybe it can live in Substrait then since there is also a
> growing
> > > collection there of compute function definitions and interpretations.
> > >
> > > > Even saying that the "string" type should include REE-encoded and
> > > dictionary-encoded strings is a stretch today.
> > >
> > > Why?  My semi-formal internal definition has been "We say types T1 and
> T2
> > > are the same logical type if there exists a bidirectional 1:1 mapping
> > > function M from T1 to T2 such that given every function F and every
> > value x
> > > we have F(x) = M(F(M(x))".  Under this definition dictionary and REE
> are
> > > just encodings while uint8 and uint16 are different types (since they
> > have
> > > different results for functions like 100+200).
> > >
> > > > Things start to break when you start exporting arrays to more naive
> > > compute systems that don't support these encodings yet.
> > >
> > > This is true regardless
> > >
> > > > it will be unrealistic to expect more than one implementation of such
> > > system.
> > >
> > > Agreed, implementing all the compute kernels is not an Arrow problem.
> > > Though maybe those waters have been muddied in the past since Arrow
> > > implementations like arrow-cpp and arrow-rs have historically come
> with a
> > > collection of compute functions.
> > >
> > > > A semi-opaque format could be designed while allowing slicing
> > >
> > > Ah, by leaving the buffers untouched but setting the offset?  That
> makes
> > > sense.
> > >
> > > > Things get trickier when concatenating arrays ... and exporting
> arrays
> > > ... without leaking data
> > >
> > > I think we can also add operations like filter and take to that list.
> > >
> > > On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho <
> > > [email protected]> wrote:
> > >
> > > > Please don't interpret this as a harsh comment, I mean well and don't
> > speak
> > > > for the whole community.
> > > >
> > > > >  I think we would need a "semantic schema" or "logical schema"
> which
> > > > indicates the logical type but not the physical representation.
> > > >
> > > > This separation between logical and physical gets mentioned a lot,
> but
> > even
> > > > if the type-representation problem is solved, the real challenge is
> > having
> > > > compute kernels that are complete so they can be flexible enough to
> > handle
> > > > all the physical forms of a logical type like "string" or have a
> smart
> > > > enough function dispatcher that performs casts on-the-fly as a
> > non-ideal
> > > > fallback (e.g. expanding FSST-encoded strings to a more basic and
> > widely
> > > > supported format when the compute kernel is more naive).
> > > >
> > > > That is what makes introducing these ideas to Arrow so tricky. Arrow
> > being
> > > > the data format that shines in interoperability can't have unbounded
> > > > complexity. The definition of any logical type can only be coupled
> to a
> > > > compute system. Datafusion could define all the encodings that form a
> > > > logical type, but it's really hard to specify what a logical type is
> > meant
> > > > to be on every compute system using Arrow. Even saying that the
> > "string"
> > > > type should include REE-encoded and dictionary-encoded strings is a
> > stretch
> > > > today. Things start to break when you start exporting arrays to more
> > naive
> > > > compute systems that don't support these encodings yet. If you care
> > less
> > > > about multiple implementations of a format, interoperability, and can
> > > > provide all compute functions in your closed system, then you can
> > greatly
> > > > expand the formats and encodings you support. But it will be
> > unrealistic to
> > > > expect more than one implementation of such system.
> > > >
> > > > > Arrow users typically expect to be able to perform operations like
> > > > "slice" and "take" which require some knowledge of the underlying
> type.
> > > >
> > > > Exactly. There are many simplifying assumptions that can be made when
> > one
> > > > gets an Arrow RecordBatch and wants to do something with it directly
> > (i.e.
> > > > without using a library of compute kernels). It's already challenging
> > > > enough to get people to stop converting columnar data to row-based
> > arrays.
> > > >
> > > > My recommendation is that we start thinking about proposing formats
> and
> > > > logical types to "compute systems" and not to the "arrow data
> format'.
> > IMO
> > > > "late materialization" doesn't make sense as an Arrow specification.
> > Unless
> > > > it's a series of widely useful canonical extension types expressible
> > > > through other storage types like binary or fixed-size-binary.
> > > >
> > > > A compute system (e.g. Datafusion) that intends to implement late
> > > > materialization would have to expand operand representation a bit to
> > take
> > > > in non-materialized data handles in ways that aren't expressible in
> the
> > > > open Arrow format.
> > > >
> > > > > Do you think we would come up with a semi-opaque array that could
> be
> > > > sliced?  Or that we would introduce the concept of an unsliceable
> > array?
> > > >
> > > > The String/BinaryView format, when sliced doesn't necessarily stop
> > carrying
> > > > the data buffers. A semi-opaque format could be designed while
> allowing
> > > > slicing, Things get trickier when concatenating arrays (which is also
> > an
> > > > operation that is meant to discard unnecessary data when it
> reallocates
> > > > buffers) and exporting arrays through the C Data Interface without
> > leaking
> > > > data that is not necessary in a slice.
> > > >
> > > > --
> > > > Felipe
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]>
> > > > wrote:
> > > >
> > > > > I think this is a very interesting idea.  This could potentially
> > open up
> > > > > the door for things like adding compute kernels for these
> compressed
> > > > > representations to Arrow or Datafusion.  Though it isn't without
> some
> > > > > challenges.
> > > > >
> > > > > > It seems FSSTStringVector/Array could potentially be modelled
> > > > > > as an extension type
> > > > > > ...
> > > > > > This would however require a fixed dictionary, so might not
> > > > > > be desirable.
> > > > > > ...
> > > > > > ALPFloatingPointVector and bit-packed vectors/arrays are more
> > > > challenging
> > > > > > to represent as extension types.
> > > > > > ...
> > > > > > Each batch of values has a different metadata parameter set.
> > > > >
> > > > > I think these are basically the same problem.  From what I've seen
> in
> > > > > implementations a format will typically introduce some kind of
> small
> > > > batch
> > > > > concept (I think every 1024 values in the Fast Lanes paper IIRC).
> So
> > > > > either we need individual record batches for each small batch (in
> > which
> > > > > case the Arrow representation is more straightforward but batches
> are
> > > > quite
> > > > > small) or we need some concept of a batched array in Arrow.  If we
> > want
> > > > > individual record batches for each small batch that requires the
> > batch
> > > > > sizes to be consistent (in number of rows) between columns and I
> > don't
> > > > know
> > > > > if that's always true.
> > > > >
> > > > > > One of the discussion items is to allow late materialization: to
> > allow
> > > > > > keeping data in encoded format beyond the filter stage (for
> > example in
> > > > > > Datafusion).
> > > > >
> > > > > > Vortex seems to show that it is possible to support advanced
> > > > > > encodings (like ALP, FSST, or others) by separating the logical
> > type
> > > > > > from the physical encoding.
> > > > >
> > > > > Pierre brings up another challenge in achieving this goal, which
> may
> > be
> > > > > more significant.  The compression and encoding techniques
> typically
> > vary
> > > > > from page to page within Parquet (this is even more true in formats
> > like
> > > > > fast lanes and vortex).  A column might use ALP for one page and
> > then use
> > > > > PLAIN encoding for the next page.  This makes it difficult to
> > represent a
> > > > > stream of data with the typical Arrow schema we have today.  I
> think
> > we
> > > > > would need a "semantic schema" or "logical schema" which indicates
> > the
> > > > > logical type but not the physical representation.  Still, that can
> > be an
> > > > > orthogonal discussion to FSST and ALP representation.
> > > > >
> > > > > >  We could also experiment with Opaque vectors.
> > > > >
> > > > > This could be an interesting approach too.  I don't know if they
> > could be
> > > > > entirely opaque though.  Arrow users typically expect to be able to
> > > > perform
> > > > > operations like "slice" and "take" which require some knowledge of
> > the
> > > > > underlying type.  Do you think we would come up with a semi-opaque
> > array
> > > > > that could be sliced?  Or that we would introduce the concept of an
> > > > > unsliceable array?
> > > > >
> > > > >
> > > > > On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am relatively new to this space, so I apologize if I am missing
> > some
> > > > > > context or history here. I wanted to share some observations from
> > what
> > > > I
> > > > > > see happening with projects like Vortex.
> > > > > >
> > > > > > Vortex seems to show that it is possible to support advanced
> > encodings
> > > > > > (like ALP, FSST, or others) by separating the logical type from
> the
> > > > > > physical encoding. If the consumer engine supports the advanced
> > > > encoding,
> > > > > > it stays compressed and fast. If not, the data is "canonicalized"
> > to
> > > > > > standard Arrow arrays at the edge.
> > > > > >
> > > > > > As Parquet adopts these novel encodings, the current Arrow
> approach
> > > > > forces
> > > > > > us to "densify" or decompress data immediately, even if the
> engine
> > > > could
> > > > > > have operated on the encoded data.
> > > > > >
> > > > > > Is there a world where Arrow could offer some sort of negotiation
> > > > > > mechanism? The goal would be to guarantee the data can always be
> > read
> > > > as
> > > > > > standard "safe" physical types (paying a cost only at the
> > boundary),
> > > > > while
> > > > > > allowing systems that understand the advanced encoding to let the
> > data
> > > > > flow
> > > > > > through efficiently.
> > > > > >
> > > > > > This sounds like it keep the safety of the interoperability -
> Arrow
> > > > > making
> > > > > > sure new encodings have a canonical representation - and it leave
> > the
> > > > > onus
> > > > > > of implemented the efficient flow to the consumer - decoupling
> > > > efficiency
> > > > > > from interoperability.
> > > > > >
> > > > > > Thanks !
> > > > > >
> > > > > > Pierre
> > > > > >
> > > > > > On 2025/12/11 06:49:30 Micah Kornfield wrote:
> > > > > > > I think this is an interesting idea.  Julien, do you have a
> > proposal
> > > > > for
> > > > > > > scope?  Is the intent to be 1:1 with any new encoding that is
> > added
> > > > to
> > > > > > > Parquet?  For instance would the intent be to also put
> cascading
> > > > > > encodings
> > > > > > > in Arrow?
> > > > > > >
> > > > > > > We could also experiment with Opaque vectors.
> > > > > > >
> > > > > > >
> > > > > > > Did you mean this as a new type? I think this would be
> necessary
> > for
> > > > > ALP.
> > > > > > >
> > > > > > > It seems FSSTStringVector/Array could potentially be modelled
> as
> > an
> > > > > > > extension type (dictionary stored as part of the type
> metadata?)
> > on
> > > > top
> > > > > > of
> > > > > > > a byte array. This would however require a fixed dictionary, so
> > might
> > > > > not
> > > > > > > be desirable.
> > > > > > >
> > > > > > > ALPFloatingPointVector and bit-packed vectors/arrays are more
> > > > > challenging
> > > > > > > to represent as extension types.
> > > > > > >
> > > > > > > 1.  There is no natural alignment with any of the existing
> types
> > (and
> > > > > the
> > > > > > > bit-packing width can effectively vary by batch).
> > > > > > > 2.  Each batch of values has a different metadata parameter
> set.
> > > > > > >
> > > > > > > So it seems there is no easy way out for the ALP encoding and
> we
> > > > either
> > > > > > > need to pay the cost of adding a new type (which is not
> > necessarily
> > > > > > > trivial) or we would have to do some work to literally make a
> new
> > > > > opaque
> > > > > > > "Custom" Type, which would have a buffer that is only
> > interpretable
> > > > > based
> > > > > > > on its extension type.  An easy way of shoe-horning this in
> > would be
> > > > to
> > > > > > add
> > > > > > > a ParquetScalar extension type, which simply contains the
> > > > decompressed
> > > > > > but
> > > > > > > encoded Parquet page with repetition and definition levels
> > stripped
> > > > > out.
> > > > > > > The latter also has its obvious down-sides.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Micah
> > > > > > >
> > > > > > > [1]
> > https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
> > > > > > > [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf
> > > > > > >
> > > > > > > On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > I forgot to mention that those encodings have the
> > particularity of
> > > > > > allowing
> > > > > > > > random access without decoding previous values.
> > > > > > > >
> > > > > > > > On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > > Parquet is in the process of adopting new encodings [1]
> > > > (Currently
> > > > > > in POC
> > > > > > > > > stage), specifically ALP [2] and FSST [3].
> > > > > > > > > One of the discussion items is to allow late
> > materialization: to
> > > > > > allow
> > > > > > > > > keeping data in encoded format beyond the filter stage (for
> > > > example
> > > > > > in
> > > > > > > > > Datafusion).
> > > > > > > > > There are several advantages to this:
> > > > > > > > > - For example, if I summarize FSST as a variation of
> > dictionary
> > > > > > encoding
> > > > > > > > > on substrings in the values, one can evaluate some
> > operations on
> > > > > > encoded
> > > > > > > > > values without decoding them, saving memory and CPU.
> > > > > > > > > - Similarly, simplifying for brevity, ALP converts floating
> > point
> > > > > > values
> > > > > > > > > to small integers that are then bitpacked.
> > > > > > > > > The Vortex project argues that keeping encoded values in
> > > > in-memory
> > > > > > > > vectors
> > > > > > > > > opens up opportunities for performance improvements. [4] a
> > third
> > > > > > party
> > > > > > > > blog
> > > > > > > > > argues it's a problem as well [5]
> > > > > > > > >
> > > > > > > > > So I wanted to start a discussion to suggest, we might
> > consider
> > > > > > adding
> > > > > > > > > some additional vectors to support such encoded Values like
> > an
> > > > > > > > > FSSTStringVector for example. This would not be too
> different
> > > > from
> > > > > > the
> > > > > > > > > dictionary encoding, or an ALPFloatingPointVector with a
> bit
> > > > packed
> > > > > > > > scheme
> > > > > > > > > not too different from what we use for nullability.
> > > > > > > > > We could also experiment with Opaque vectors.
> > > > > > > > >
> > > > > > > > > For reference, similarly motivated improvements have been
> > done in
> > > > > the
> > > > > > > > past
> > > > > > > > > [6]
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > See:
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
> > > > > > > > > [2] https://github.com/apache/arrow/pull/48345
> > > > > > > > > [3] https://github.com/apache/arrow/pull/48232
> > > > > > > > > [4] https://docs.vortex.dev/#in-memory
> > > > > > > > > [5]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
> > > > > > > > > [6]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to