Hi Vladimir,

What you are requesting is already supported in parquet-rs. In particular if you request a UTF8 or Binary DictionaryArray for the column it will decode the column preserving the dictionary encoding. You can override the embedded arrow schema, if any, using ArrowReaderOptions::with_schema [1]. Provided you don't read RecordBatch across row groups and therefore across dictionaries, which the async reader doesn't, this should never materialize the dictionary. FWIW the ViewArray decodeders will also preserve the dictionary encoding, however, the dictionary encoded nature is less explicit in the resulting arrays.

As for using integer comparisons to optimise dictionary filtering, you should be able to construct an ArrowPredicate that computes the filter for the dictionary values, caches this for future use, e.g. using ptr_eq to detect when the dictionary changes, and then filters based on dictionary keys.

It's a bit dated, but this post from 2022 outlines some of the techniques supported by parquet-rs [2] including its support for late materialization.

Kind Regards,

Raphael

[2]: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

On 30/12/2025 17:52, Vladimir Ozerov wrote:
Hi,

We are facing a somewhat similar issue at the moment in Rust. We have a
very large Parquet data set, where application typically applies one or
several equality filters on a low-cardinality strings. Data is organized in
a way so that late materialization (i.e. pushing a predicate directly to
Parquet reader to build a selection vector and prune individual pages) is
important. We quickly hit the CPU wall, and profiling shows that most of
the time is spent on filter evaluation, i.e. comparing strings. Since
filter columns has very low cardinality and are dictionary encoded, an
obvious optimization is to map filter to a dictionary index, and then
compare ints instead of strings. But since we cannot extract dictionary
from Parquet in any form, this optimization is largely impossible without
forking the parquet-rs, which is highly undesirable.

Discussing how to expose dictionary data may lead to multiple overlapping
considerations, long discussions and perhaps format and API changes. So we
hope that there could be some loopholes or small change that could
potentially unblock such optimization without going into a large design/API
space. For instance:

    1. Can we introduce a hint to ParquetReader which will produce
    DictionaryArray for the given column instead of a concrete array
    (StringViewArray in our case)?
    2. When doing late materialization, maybe we can extend ArrowPredicate,
    so that it first instructs Parquet reader that it wants to get encoded
    dictionaries first, and once they are supplied, return another predicate
    that will be applied to encoded data. E.g., "x = some_value" translates to
    "x_encoded = index".

These are just very rough ideas, but we need a solution for this pretty
fast because CPU overhead blocks our project, and we cannot change the
dataset. So most likely we will end up with some sort of fork. But if the
community gives us some directions what else we can try, we can share our
experience and results in several weeks, and hopefully this would help
community build a better solution for this problem.

Regards,
Vladimir


On Fri, Dec 19, 2025 at 4:38 AM Micah Kornfield <[email protected]>
wrote:

Hi Pierre,

I'm somewhat expanding on Weston's  and Felipe's points above, but wanted
to make it explicit.

There are really two problems being discussed.

1. Arrow's representation of the data (either in memory or on disk).

I think the main complication brought up here is Arrow data stream model
couples physical and logical type.  Making it hard to switch to different
encodings between different batches of data.  I think there are two main
parts to resolving this:
     a.  Suggest a modelling in the flatbuffer model (e.g.
https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi (
https://arrow.apache.org/docs/format/CDataInterface.html) to disentangle
the two of them.
     b.  Update libraries to accommodate the decoupling, so data can be
exchanged with the new modelling.

  2. How implementations carry this through and don't end up breaking
existing functionality around Arrow structures

    a. As Felipe pointed out I think at least a few implementations of
operations don't scale well to adding multiple different encodings.
Rethinking the architecture here might be required.


My take is that "1a" is probably the easy part.  1b and 2a are the hard
parts because there have been a lot of core assumptions baked into
libraries about the structure of  Arrow array.  If we can convince
ourselves that there is a tractable path forward for hard parts I would
guess it would go a long way to convincing skeptics.  The easiest way of
proving or disproving something like this would be trying to prototype
something in at least one of the libraries to see what the complexity looks
like (e.g. in arrow-rs or arrow-cpp which have sizable compute
implementations).

In short, the spec changes probably are possibly not super large, but the
downstream implications of them are.  As you and others have pointed out, I
think it is worth trying to tackle this issue for the long term viability
of the project, but I'm concerned there might not be enough developer
bandwidth to make it happen.

Cheers,
Micah




On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]> wrote:

I've been following this thread with interest and wanted to share a more
specific concern regarding the potential overhead of data expansion
where I
think this really matters in our case.


It seems that as Parquet adopts more sophisticated, non-data-dependent
encodings like ALP and FSST, there might be a risk of Arrow becoming a
bottleneck if the format requires immediate "densification" into standard
arrays.

I am particularly thinking about use cases like compaction or merging of
multiple files. In these scenarios, it feels like we might be paying a
significant CPU and memory tax to expand encoded Parquet data into a flat
Arrow representation, even if the operation itself, essentially a
pass-through or a reorganization and might not strictly require that
expansion.

I (think) I understand the concerns with adding more types and late
materialisation concepts.

However I'm curious if there is a path for Arrow to evolve that would
allow for this kind of efficiency avoiding said complexity , perhaps by:
   * Potentially exploring a negotiation mechanism where the consumer can
opt-out of full "densification" if it understands the underlying
encoding -
the onus of compute left to the user.
   * Considering if "semi-opaque" or encoded vectors could be supported
for
operations that don't require full decoding, such as simple slices or
merges.


If the goal of Arrow is to remain the gold standard for interoperability,
I wonder if it might eventually need to account for these "on-encoded
data"
patterns to keep pace with the efficiencies found in modern storage
formats.
I'd be very interested to hear if this is a direction the community feels
is viable or if these optimizations are viewed as being better handled
strictly within the compute engines themselves.

I am (still) quite new to Arrow and still catching up on the nuances of
the specification, so please excuse any naive assumptions in my
reasoning.
Best!

On 2025/12/18 17:55:58 Weston Pace wrote:
the real challenge is having compute kernels that are complete
so they can be flexible enough to handle all the physical forms
of a logical type
Agreed.  My experience has also been that even when compute systems
claim
to solve this problem (e.g. DuckDb) the implementations are still
rather
limited.  Part of the problem is also, I think, that only a small
fraction
of cases are going to be compute bound in a way that this sort of
optimization will help.  Beyond major vendors (e.g. Databricks,
Snowflake,
Meta, ...) the scale just isn't there to justify the cost of
development
and maintenance.

The definition of any logical type can only be coupled to a compute
system.

I guess I disagree here.  We did this once already in Substrait.  I
think
there is value in an interoperable agreement of what constitutes a
logical
type.  Maybe it can live in Substrait then since there is also a
growing
collection there of compute function definitions and interpretations.

Even saying that the "string" type should include REE-encoded and
dictionary-encoded strings is a stretch today.

Why?  My semi-formal internal definition has been "We say types T1 and
T2
are the same logical type if there exists a bidirectional 1:1 mapping
function M from T1 to T2 such that given every function F and every
value x
we have F(x) = M(F(M(x))".  Under this definition dictionary and REE
are
just encodings while uint8 and uint16 are different types (since they
have
different results for functions like 100+200).

Things start to break when you start exporting arrays to more naive
compute systems that don't support these encodings yet.

This is true regardless

it will be unrealistic to expect more than one implementation of such
system.

Agreed, implementing all the compute kernels is not an Arrow problem.
Though maybe those waters have been muddied in the past since Arrow
implementations like arrow-cpp and arrow-rs have historically come
with a
collection of compute functions.

A semi-opaque format could be designed while allowing slicing
Ah, by leaving the buffers untouched but setting the offset?  That
makes
sense.

Things get trickier when concatenating arrays ... and exporting
arrays
... without leaking data

I think we can also add operations like filter and take to that list.

On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho <
[email protected]> wrote:

Please don't interpret this as a harsh comment, I mean well and don't
speak
for the whole community.

  I think we would need a "semantic schema" or "logical schema"
which
indicates the logical type but not the physical representation.

This separation between logical and physical gets mentioned a lot,
but
even
if the type-representation problem is solved, the real challenge is
having
compute kernels that are complete so they can be flexible enough to
handle
all the physical forms of a logical type like "string" or have a
smart
enough function dispatcher that performs casts on-the-fly as a
non-ideal
fallback (e.g. expanding FSST-encoded strings to a more basic and
widely
supported format when the compute kernel is more naive).

That is what makes introducing these ideas to Arrow so tricky. Arrow
being
the data format that shines in interoperability can't have unbounded
complexity. The definition of any logical type can only be coupled
to a
compute system. Datafusion could define all the encodings that form a
logical type, but it's really hard to specify what a logical type is
meant
to be on every compute system using Arrow. Even saying that the
"string"
type should include REE-encoded and dictionary-encoded strings is a
stretch
today. Things start to break when you start exporting arrays to more
naive
compute systems that don't support these encodings yet. If you care
less
about multiple implementations of a format, interoperability, and can
provide all compute functions in your closed system, then you can
greatly
expand the formats and encodings you support. But it will be
unrealistic to
expect more than one implementation of such system.

Arrow users typically expect to be able to perform operations like
"slice" and "take" which require some knowledge of the underlying
type.
Exactly. There are many simplifying assumptions that can be made when
one
gets an Arrow RecordBatch and wants to do something with it directly
(i.e.
without using a library of compute kernels). It's already challenging
enough to get people to stop converting columnar data to row-based
arrays.
My recommendation is that we start thinking about proposing formats
and
logical types to "compute systems" and not to the "arrow data
format'.
IMO
"late materialization" doesn't make sense as an Arrow specification.
Unless
it's a series of widely useful canonical extension types expressible
through other storage types like binary or fixed-size-binary.

A compute system (e.g. Datafusion) that intends to implement late
materialization would have to expand operand representation a bit to
take
in non-materialized data handles in ways that aren't expressible in
the
open Arrow format.

Do you think we would come up with a semi-opaque array that could
be
sliced?  Or that we would introduce the concept of an unsliceable
array?
The String/BinaryView format, when sliced doesn't necessarily stop
carrying
the data buffers. A semi-opaque format could be designed while
allowing
slicing, Things get trickier when concatenating arrays (which is also
an
operation that is meant to discard unnecessary data when it
reallocates
buffers) and exporting arrays through the C Data Interface without
leaking
data that is not necessary in a slice.

--
Felipe




On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]>
wrote:

I think this is a very interesting idea.  This could potentially
open up
the door for things like adding compute kernels for these
compressed
representations to Arrow or Datafusion.  Though it isn't without
some
challenges.

It seems FSSTStringVector/Array could potentially be modelled
as an extension type
...
This would however require a fixed dictionary, so might not
be desirable.
...
ALPFloatingPointVector and bit-packed vectors/arrays are more
challenging
to represent as extension types.
...
Each batch of values has a different metadata parameter set.
I think these are basically the same problem.  From what I've seen
in
implementations a format will typically introduce some kind of
small
batch
concept (I think every 1024 values in the Fast Lanes paper IIRC).
So
either we need individual record batches for each small batch (in
which
case the Arrow representation is more straightforward but batches
are
quite
small) or we need some concept of a batched array in Arrow.  If we
want
individual record batches for each small batch that requires the
batch
sizes to be consistent (in number of rows) between columns and I
don't
know
if that's always true.

One of the discussion items is to allow late materialization: to
allow
keeping data in encoded format beyond the filter stage (for
example in
Datafusion).
Vortex seems to show that it is possible to support advanced
encodings (like ALP, FSST, or others) by separating the logical
type
from the physical encoding.
Pierre brings up another challenge in achieving this goal, which
may
be
more significant.  The compression and encoding techniques
typically
vary
from page to page within Parquet (this is even more true in formats
like
fast lanes and vortex).  A column might use ALP for one page and
then use
PLAIN encoding for the next page.  This makes it difficult to
represent a
stream of data with the typical Arrow schema we have today.  I
think
we
would need a "semantic schema" or "logical schema" which indicates
the
logical type but not the physical representation.  Still, that can
be an
orthogonal discussion to FSST and ALP representation.

  We could also experiment with Opaque vectors.
This could be an interesting approach too.  I don't know if they
could be
entirely opaque though.  Arrow users typically expect to be able to
perform
operations like "slice" and "take" which require some knowledge of
the
underlying type.  Do you think we would come up with a semi-opaque
array
that could be sliced?  Or that we would introduce the concept of an
unsliceable array?


On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]>
wrote:
Hi all,

I am relatively new to this space, so I apologize if I am missing
some
context or history here. I wanted to share some observations from
what
I
see happening with projects like Vortex.

Vortex seems to show that it is possible to support advanced
encodings
(like ALP, FSST, or others) by separating the logical type from
the
physical encoding. If the consumer engine supports the advanced
encoding,
it stays compressed and fast. If not, the data is "canonicalized"
to
standard Arrow arrays at the edge.

As Parquet adopts these novel encodings, the current Arrow
approach
forces
us to "densify" or decompress data immediately, even if the
engine
could
have operated on the encoded data.

Is there a world where Arrow could offer some sort of negotiation
mechanism? The goal would be to guarantee the data can always be
read
as
standard "safe" physical types (paying a cost only at the
boundary),
while
allowing systems that understand the advanced encoding to let the
data
flow
through efficiently.

This sounds like it keep the safety of the interoperability -
Arrow
making
sure new encodings have a canonical representation - and it leave
the
onus
of implemented the efficient flow to the consumer - decoupling
efficiency
from interoperability.

Thanks !

Pierre

On 2025/12/11 06:49:30 Micah Kornfield wrote:
I think this is an interesting idea.  Julien, do you have a
proposal
for
scope?  Is the intent to be 1:1 with any new encoding that is
added
to
Parquet?  For instance would the intent be to also put
cascading
encodings
in Arrow?

We could also experiment with Opaque vectors.


Did you mean this as a new type? I think this would be
necessary
for
ALP.
It seems FSSTStringVector/Array could potentially be modelled
as
an
extension type (dictionary stored as part of the type
metadata?)
on
top
of
a byte array. This would however require a fixed dictionary, so
might
not
be desirable.

ALPFloatingPointVector and bit-packed vectors/arrays are more
challenging
to represent as extension types.

1.  There is no natural alignment with any of the existing
types
(and
the
bit-packing width can effectively vary by batch).
2.  Each batch of values has a different metadata parameter
set.
So it seems there is no easy way out for the ALP encoding and
we
either
need to pay the cost of adding a new type (which is not
necessarily
trivial) or we would have to do some work to literally make a
new
opaque
"Custom" Type, which would have a buffer that is only
interpretable
based
on its extension type.  An easy way of shoe-horning this in
would be
to
add
a ParquetScalar extension type, which simply contains the
decompressed
but
encoded Parquet page with repetition and definition levels
stripped
out.
The latter also has its obvious down-sides.

Cheers,
Micah

[1]
https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160
[2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf

On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem <
[email protected]
wrote:
I forgot to mention that those encodings have the
particularity of
allowing
random access without decoding previous values.

On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem <
[email protected]>
wrote:
Hello,
Parquet is in the process of adopting new encodings [1]
(Currently
in POC
stage), specifically ALP [2] and FSST [3].
One of the discussion items is to allow late
materialization: to
allow
keeping data in encoded format beyond the filter stage (for
example
in
Datafusion).
There are several advantages to this:
- For example, if I summarize FSST as a variation of
dictionary
encoding
on substrings in the values, one can evaluate some
operations on
encoded
values without decoding them, saving memory and CPU.
- Similarly, simplifying for brevity, ALP converts floating
point
values
to small integers that are then bitpacked.
The Vortex project argues that keeping encoded values in
in-memory
vectors
opens up opportunities for performance improvements. [4] a
third
party
blog
argues it's a problem as well [5]

So I wanted to start a discussion to suggest, we might
consider
adding
some additional vectors to support such encoded Values like
an
FSSTStringVector for example. This would not be too
different
from
the
dictionary encoding, or an ALPFloatingPointVector with a
bit
packed
scheme
not too different from what we use for nullability.
We could also experiment with Opaque vectors.

For reference, similarly motivated improvements have been
done in
the
past
[6]

Thoughts?

See:
[1]

https://github.com/apache/parquet-format/tree/master/proposals#active-proposals
[2] https://github.com/apache/arrow/pull/48345
[3] https://github.com/apache/arrow/pull/48232
[4] https://docs.vortex.dev/#in-memory
[5]

https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex
[6]

https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Reply via email to