Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Raphael Taylor-Davies Tue, 30 Dec 2025 11:00:03 -0800

Hi Vladimir,

What you are requesting is already supported in parquet-rs. Inparticular if you request a UTF8 or Binary DictionaryArray for thecolumn it will decode the column preserving the dictionary encoding. Youcan override the embedded arrow schema, if any, usingArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatchacross row groups and therefore across dictionaries, which the asyncreader doesn't, this should never materialize the dictionary. FWIW theViewArray decodeders will also preserve the dictionary encoding,however, the dictionary encoded nature is less explicit in the resultingarrays.

As for using integer comparisons to optimise dictionary filtering, youshould be able to construct an ArrowPredicate that computes the filterfor the dictionary values, caches this for future use, e.g. using ptr_eqto detect when the dictionary changes, and then filters based ondictionary keys.

It's a bit dated, but this post from 2022 outlines some of thetechniques supported by parquet-rs [2] including its support for latematerialization.


Kind Regards,

Raphael

[2]:https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/


On 30/12/2025 17:52, Vladimir Ozerov wrote:

Hi,

We are facing a somewhat similar issue at the moment in Rust. We have a
very large Parquet data set, where application typically applies one or
several equality filters on a low-cardinality strings. Data is organized in
a way so that late materialization (i.e. pushing a predicate directly to
Parquet reader to build a selection vector and prune individual pages) is
important. We quickly hit the CPU wall, and profiling shows that most of
the time is spent on filter evaluation, i.e. comparing strings. Since
filter columns has very low cardinality and are dictionary encoded, an
obvious optimization is to map filter to a dictionary index, and then
compare ints instead of strings. But since we cannot extract dictionary
from Parquet in any form, this optimization is largely impossible without
forking the parquet-rs, which is highly undesirable.

Discussing how to expose dictionary data may lead to multiple overlapping
considerations, long discussions and perhaps format and API changes. So we
hope that there could be some loopholes or small change that could
potentially unblock such optimization without going into a large design/API
space. For instance:

    1. Can we introduce a hint to ParquetReader which will produce
    DictionaryArray for the given column instead of a concrete array
    (StringViewArray in our case)?
    2. When doing late materialization, maybe we can extend ArrowPredicate,
    so that it first instructs Parquet reader that it wants to get encoded
    dictionaries first, and once they are supplied, return another predicate
    that will be applied to encoded data. E.g., "x = some_value" translates to
    "x_encoded = index".

These are just very rough ideas, but we need a solution for this pretty
fast because CPU overhead blocks our project, and we cannot change the
dataset. So most likely we will end up with some sort of fork. But if the
community gives us some directions what else we can try, we can share our
experience and results in several weeks, and hopefully this would help
community build a better solution for this problem.

Regards,
Vladimir


On Fri, Dec 19, 2025 at 4:38 AM Micah Kornfield <[email protected]>
wrote:

Hi Pierre,

I'm somewhat expanding on Weston's  and Felipe's points above, but wanted
to make it explicit.

There are really two problems being discussed.

1. Arrow's representation of the data (either in memory or on disk).

I think the main complication brought up here is Arrow data stream model
couples physical and logical type.  Making it hard to switch to different
encodings between different batches of data.  I think there are two main
parts to resolving this:
     a.  Suggest a modelling in the flatbuffer model (e.g.
https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi (
https://arrow.apache.org/docs/format/CDataInterface.html) to disentangle
the two of them.
     b.  Update libraries to accommodate the decoupling, so data can be
exchanged with the new modelling.

  2. How implementations carry this through and don't end up breaking
existing functionality around Arrow structures

    a. As Felipe pointed out I think at least a few implementations of
operations don't scale well to adding multiple different encodings.
Rethinking the architecture here might be required.


My take is that "1a" is probably the easy part.  1b and 2a are the hard
parts because there have been a lot of core assumptions baked into
libraries about the structure of  Arrow array.  If we can convince
ourselves that there is a tractable path forward for hard parts I would
guess it would go a long way to convincing skeptics.  The easiest way of
proving or disproving something like this would be trying to prototype
something in at least one of the libraries to see what the complexity looks
like (e.g. in arrow-rs or arrow-cpp which have sizable compute
implementations).

In short, the spec changes probably are possibly not super large, but the
downstream implications of them are.  As you and others have pointed out, I
think it is worth trying to tackle this issue for the long term viability
of the project, but I'm concerned there might not be enough developer
bandwidth to make it happen.

Cheers,
Micah




On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]> wrote:

I've been following this thread with interest and wanted to share a more
specific concern regarding the potential overhead of data expansion

where I

think this really matters in our case.


It seems that as Parquet adopts more sophisticated, non-data-dependent
encodings like ALP and FSST, there might be a risk of Arrow becoming a
bottleneck if the format requires immediate "densification" into standard
arrays.

I am particularly thinking about use cases like compaction or merging of
multiple files. In these scenarios, it feels like we might be paying a
significant CPU and memory tax to expand encoded Parquet data into a flat
Arrow representation, even if the operation itself, essentially a
pass-through or a reorganization and might not strictly require that
expansion.

I (think) I understand the concerns with adding more types and late
materialisation concepts.

However I'm curious if there is a path for Arrow to evolve that would
allow for this kind of efficiency avoiding said complexity , perhaps by:
   * Potentially exploring a negotiation mechanism where the consumer can
opt-out of full "densification" if it understands the underlying

encoding -

the onus of compute left to the user.
   * Considering if "semi-opaque" or encoded vectors could be supported

for

operations that don't require full decoding, such as simple slices or
merges.


If the goal of Arrow is to remain the gold standard for interoperability,
I wonder if it might eventually need to account for these "on-encoded

data"

patterns to keep pace with the efficiencies found in modern storage

formats.

I'd be very interested to hear if this is a direction the community feels
is viable or if these optimizations are viewed as being better handled
strictly within the compute engines themselves.

I am (still) quite new to Arrow and still catching up on the nuances of
the specification, so please excuse any naive assumptions in my

reasoning.

Best!

On 2025/12/18 17:55:58 Weston Pace wrote:

the real challenge is having compute kernels that are complete
so they can be flexible enough to handle all the physical forms
of a logical type

Agreed.  My experience has also been that even when compute systems

claim

to solve this problem (e.g. DuckDb) the implementations are still

rather

limited.  Part of the problem is also, I think, that only a small

fraction

of cases are going to be compute bound in a way that this sort of
optimization will help.  Beyond major vendors (e.g. Databricks,

Snowflake,

Meta, ...) the scale just isn't there to justify the cost of

development

and maintenance.

The definition of any logical type can only be coupled to a compute

system.

I guess I disagree here.  We did this once already in Substrait.  I

think

there is value in an interoperable agreement of what constitutes a

logical

type.  Maybe it can live in Substrait then since there is also a

growing

collection there of compute function definitions and interpretations.

Even saying that the "string" type should include REE-encoded and

dictionary-encoded strings is a stretch today.

Why?  My semi-formal internal definition has been "We say types T1 and

T2

are the same logical type if there exists a bidirectional 1:1 mapping
function M from T1 to T2 such that given every function F and every

value x

we have F(x) = M(F(M(x))".  Under this definition dictionary and REE

are

just encodings while uint8 and uint16 are different types (since they

have

different results for functions like 100+200).

Things start to break when you start exporting arrays to more naive

compute systems that don't support these encodings yet.

This is true regardless

it will be unrealistic to expect more than one implementation of such

system.

Agreed, implementing all the compute kernels is not an Arrow problem.
Though maybe those waters have been muddied in the past since Arrow
implementations like arrow-cpp and arrow-rs have historically come

with a

collection of compute functions.

A semi-opaque format could be designed while allowing slicing

Ah, by leaving the buffers untouched but setting the offset?  That

makes

sense.

Things get trickier when concatenating arrays ... and exporting

arrays

... without leaking data

I think we can also add operations like filter and take to that list.

On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho <
[email protected]> wrote:

Please don't interpret this as a harsh comment, I mean well and don't

speak

for the whole community.

  I think we would need a "semantic schema" or "logical schema"

which

indicates the logical type but not the physical representation.

This separation between logical and physical gets mentioned a lot,

but

even

if the type-representation problem is solved, the real challenge is

having

compute kernels that are complete so they can be flexible enough to

handle

all the physical forms of a logical type like "string" or have a

smart

enough function dispatcher that performs casts on-the-fly as a

non-ideal

fallback (e.g. expanding FSST-encoded strings to a more basic and

widely

supported format when the compute kernel is more naive).

That is what makes introducing these ideas to Arrow so tricky. Arrow

being

the data format that shines in interoperability can't have unbounded
complexity. The definition of any logical type can only be coupled

to a

compute system. Datafusion could define all the encodings that form a
logical type, but it's really hard to specify what a logical type is

meant

to be on every compute system using Arrow. Even saying that the

"string"

type should include REE-encoded and dictionary-encoded strings is a

stretch

today. Things start to break when you start exporting arrays to more

naive

compute systems that don't support these encodings yet. If you care

less

about multiple implementations of a format, interoperability, and can
provide all compute functions in your closed system, then you can

greatly

expand the formats and encodings you support. But it will be

unrealistic to

expect more than one implementation of such system.

Arrow users typically expect to be able to perform operations like

"slice" and "take" which require some knowledge of the underlying

type.

Exactly. There are many simplifying assumptions that can be made when

one

gets an Arrow RecordBatch and wants to do something with it directly

(i.e.

without using a library of compute kernels). It's already challenging
enough to get people to stop converting columnar data to row-based

arrays.

My recommendation is that we start thinking about proposing formats

and

logical types to "compute systems" and not to the "arrow data

format'.

IMO

"late materialization" doesn't make sense as an Arrow specification.

Unless

it's a series of widely useful canonical extension types expressible
through other storage types like binary or fixed-size-binary.

A compute system (e.g. Datafusion) that intends to implement late
materialization would have to expand operand representation a bit to

take

in non-materialized data handles in ways that aren't expressible in

the

open Arrow format.

Do you think we would come up with a semi-opaque array that could

be

sliced?  Or that we would introduce the concept of an unsliceable

array?

The String/BinaryView format, when sliced doesn't necessarily stop

carrying

the data buffers. A semi-opaque format could be designed while

allowing

slicing, Things get trickier when concatenating arrays (which is also

an

operation that is meant to discard unnecessary data when it

reallocates

buffers) and exporting arrays through the C Data Interface without

leaking

data that is not necessary in a slice.

--
Felipe




On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]>
wrote:

I think this is a very interesting idea.  This could potentially

open up

the door for things like adding compute kernels for these

compressed

representations to Arrow or Datafusion.  Though it isn't without

some

challenges.

It seems FSSTStringVector/Array could potentially be modelled
as an extension type
...
This would however require a fixed dictionary, so might not
be desirable.
...
ALPFloatingPointVector and bit-packed vectors/arrays are more

challenging

to represent as extension types.
...
Each batch of values has a different metadata parameter set.

I think these are basically the same problem.  From what I've seen

in

implementations a format will typically introduce some kind of

small

batch

concept (I think every 1024 values in the Fast Lanes paper IIRC).

So

either we need individual record batches for each small batch (in

which

case the Arrow representation is more straightforward but batches

are

quite

small) or we need some concept of a batched array in Arrow.  If we

want

individual record batches for each small batch that requires the

batch

sizes to be consistent (in number of rows) between columns and I

don't

know

if that's always true.

One of the discussion items is to allow late materialization: to

allow

keeping data in encoded format beyond the filter stage (for

example in

Datafusion).
Vortex seems to show that it is possible to support advanced
encodings (like ALP, FSST, or others) by separating the logical

type

from the physical encoding.

Pierre brings up another challenge in achieving this goal, which

may

be

more significant.  The compression and encoding techniques

typically

vary

from page to page within Parquet (this is even more true in formats

like

fast lanes and vortex).  A column might use ALP for one page and

then use

PLAIN encoding for the next page.  This makes it difficult to

represent a

stream of data with the typical Arrow schema we have today.  I

think

we

would need a "semantic schema" or "logical schema" which indicates

the

logical type but not the physical representation.  Still, that can

be an

orthogonal discussion to FSST and ALP representation.

  We could also experiment with Opaque vectors.

This could be an interesting approach too.  I don't know if they

could be

entirely opaque though.  Arrow users typically expect to be able to

perform

operations like "slice" and "take" which require some knowledge of

the

underlying type.  Do you think we would come up with a semi-opaque

array

that could be sliced?  Or that we would introduce the concept of an
unsliceable array?


On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]>

wrote:

Hi all,

I am relatively new to this space, so I apologize if I am missing

some

context or history here. I wanted to share some observations from

what

see happening with projects like Vortex.

Vortex seems to show that it is possible to support advanced

encodings

(like ALP, FSST, or others) by separating the logical type from

the

physical encoding. If the consumer engine supports the advanced

encoding,

it stays compressed and fast. If not, the data is "canonicalized"

to

standard Arrow arrays at the edge.

As Parquet adopts these novel encodings, the current Arrow

approach

forces

us to "densify" or decompress data immediately, even if the

engine

could

have operated on the encoded data.

Is there a world where Arrow could offer some sort of negotiation
mechanism? The goal would be to guarantee the data can always be

read

as

standard "safe" physical types (paying a cost only at the

boundary),

while

allowing systems that understand the advanced encoding to let the

data

flow

through efficiently.

This sounds like it keep the safety of the interoperability -

Arrow

making

sure new encodings have a canonical representation - and it leave

the

onus

of implemented the efficient flow to the consumer - decoupling

efficiency

from interoperability.

Thanks !

Pierre

On 2025/12/11 06:49:30 Micah Kornfield wrote:

I think this is an interesting idea.  Julien, do you have a

proposal

for

scope?  Is the intent to be 1:1 with any new encoding that is

added

to

Parquet?  For instance would the intent be to also put

cascading

encodings

in Arrow?

We could also experiment with Opaque vectors.


Did you mean this as a new type? I think this would be

necessary

for

ALP.

It seems FSSTStringVector/Array could potentially be modelled

as

an

extension type (dictionary stored as part of the type

metadata?)

on

top

of

a byte array. This would however require a fixed dictionary, so

might

not

be desirable.

ALPFloatingPointVector and bit-packed vectors/arrays are more

challenging

to represent as extension types.

1.  There is no natural alignment with any of the existing

types

(and

the

bit-packing width can effectively vary by batch).
2.  Each batch of values has a different metadata parameter

set.

So it seems there is no easy way out for the ALP encoding and

we

either

need to pay the cost of adding a new type (which is not

necessarily

trivial) or we would have to do some work to literally make a

new

opaque

"Custom" Type, which would have a buffer that is only

interpretable

based

on its extension type.  An easy way of shoe-horning this in

would be

to

add

a ParquetScalar extension type, which simply contains the

decompressed

but

encoded Parquet page with repetition and definition levels

stripped

out.

The latter also has its obvious down-sides.

Cheers,
Micah

[1]

https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160

[2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf

On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <

[email protected]

wrote:

I forgot to mention that those encodings have the

particularity of

allowing

random access without decoding previous values.

On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <

[email protected]>

wrote:

Hello,
Parquet is in the process of adopting new encodings [1]

(Currently

in POC

stage), specifically ALP [2] and FSST [3].
One of the discussion items is to allow late

materialization: to

allow

keeping data in encoded format beyond the filter stage (for

example

in

Datafusion).
There are several advantages to this:
- For example, if I summarize FSST as a variation of

dictionary

encoding

on substrings in the values, one can evaluate some

operations on

encoded

values without decoding them, saving memory and CPU.
- Similarly, simplifying for brevity, ALP converts floating

point

values

to small integers that are then bitpacked.
The Vortex project argues that keeping encoded values in

in-memory

vectors

opens up opportunities for performance improvements. [4] a

third

party

blog

argues it's a problem as well [5]

So I wanted to start a discussion to suggest, we might

consider

adding

some additional vectors to support such encoded Values like

an

FSSTStringVector for example. This would not be too

different

from

the

dictionary encoding, or an ALPFloatingPointVector with a

bit

packed

scheme

not too different from what we use for nullability.
We could also experiment with Opaque vectors.

For reference, similarly motivated improvements have been

done in

the

past

[6]

Thoughts?

See:
[1]

https://github.com/apache/parquet-format/tree/master/proposals#active-proposals

[2] https://github.com/apache/arrow/pull/48345
[3] https://github.com/apache/arrow/pull/48232
[4] https://docs.vortex.dev/#in-memory
[5]

https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex

[6]

https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Re: [DISCUSS] Support for late materialization in Parquet -> Arrow

Reply via email to