> My take is that "1a" is probably the easy part. 1b and 2a are the hard parts because there have been a lot of core assumptions baked into libraries about the structure of Arrow array.
I think these are impossible to solve problems and have the very real implication that Arrow becomes less effective as an interoperability format. This whole discussion is assuming Arrow as a bottleneck to Parquet evolution, but there is no reason to constrain Parquet libraries to *always and only produce Arrow arrays*. Parquet libraries should provide lower-level APIs that produce some *Parquet-specific form of in-memory arrays decoupled from the Arrow spec.* The main API that produces Arrow arrays turns these into Arrow arrays more eagerly. Users of the lower-level API can deal directly with the Parquet-specific in-memory arrays. -- Felipe On Thu, Dec 18, 2025 at 10:38 PM Micah Kornfield <[email protected]> wrote: > Hi Pierre, > > I'm somewhat expanding on Weston's and Felipe's points above, but wanted > to make it explicit. > > There are really two problems being discussed. > > 1. Arrow's representation of the data (either in memory or on disk). > > I think the main complication brought up here is Arrow data stream model > couples physical and logical type. Making it hard to switch to different > encodings between different batches of data. I think there are two main > parts to resolving this: > a. Suggest a modelling in the flatbuffer model (e.g. > https://github.com/apache/arrow/blob/main/format/Schema.fbs) and c-ffi ( > https://arrow.apache.org/docs/format/CDataInterface.html) to disentangle > the two of them. > b. Update libraries to accommodate the decoupling, so data can be > exchanged with the new modelling. > > 2. How implementations carry this through and don't end up breaking > existing functionality around Arrow structures > > a. As Felipe pointed out I think at least a few implementations of > operations don't scale well to adding multiple different encodings. > Rethinking the architecture here might be required. > > > My take is that "1a" is probably the easy part. 1b and 2a are the hard > parts because there have been a lot of core assumptions baked into > libraries about the structure of Arrow array. If we can convince > ourselves that there is a tractable path forward for hard parts I would > guess it would go a long way to convincing skeptics. The easiest way of > proving or disproving something like this would be trying to prototype > something in at least one of the libraries to see what the complexity looks > like (e.g. in arrow-rs or arrow-cpp which have sizable compute > implementations). > > In short, the spec changes probably are possibly not super large, but the > downstream implications of them are. As you and others have pointed out, I > think it is worth trying to tackle this issue for the long term viability > of the project, but I'm concerned there might not be enough developer > bandwidth to make it happen. > > Cheers, > Micah > > > > > On Thu, Dec 18, 2025 at 11:54 AM Pierre Lacave <[email protected]> wrote: > > > I've been following this thread with interest and wanted to share a more > > specific concern regarding the potential overhead of data expansion > where I > > think this really matters in our case. > > > > > > It seems that as Parquet adopts more sophisticated, non-data-dependent > > encodings like ALP and FSST, there might be a risk of Arrow becoming a > > bottleneck if the format requires immediate "densification" into standard > > arrays. > > > > I am particularly thinking about use cases like compaction or merging of > > multiple files. In these scenarios, it feels like we might be paying a > > significant CPU and memory tax to expand encoded Parquet data into a flat > > Arrow representation, even if the operation itself, essentially a > > pass-through or a reorganization and might not strictly require that > > expansion. > > > > I (think) I understand the concerns with adding more types and late > > materialisation concepts. > > > > However I'm curious if there is a path for Arrow to evolve that would > > allow for this kind of efficiency avoiding said complexity , perhaps by: > > * Potentially exploring a negotiation mechanism where the consumer can > > opt-out of full "densification" if it understands the underlying > encoding - > > the onus of compute left to the user. > > * Considering if "semi-opaque" or encoded vectors could be supported > for > > operations that don't require full decoding, such as simple slices or > > merges. > > > > > > If the goal of Arrow is to remain the gold standard for interoperability, > > I wonder if it might eventually need to account for these "on-encoded > data" > > patterns to keep pace with the efficiencies found in modern storage > formats. > > > > I'd be very interested to hear if this is a direction the community feels > > is viable or if these optimizations are viewed as being better handled > > strictly within the compute engines themselves. > > > > I am (still) quite new to Arrow and still catching up on the nuances of > > the specification, so please excuse any naive assumptions in my > reasoning. > > > > Best! > > > > On 2025/12/18 17:55:58 Weston Pace wrote: > > > > the real challenge is having compute kernels that are complete > > > > so they can be flexible enough to handle all the physical forms > > > > of a logical type > > > > > > Agreed. My experience has also been that even when compute systems > claim > > > to solve this problem (e.g. DuckDb) the implementations are still > rather > > > limited. Part of the problem is also, I think, that only a small > > fraction > > > of cases are going to be compute bound in a way that this sort of > > > optimization will help. Beyond major vendors (e.g. Databricks, > > Snowflake, > > > Meta, ...) the scale just isn't there to justify the cost of > development > > > and maintenance. > > > > > > > The definition of any logical type can only be coupled to a compute > > > system. > > > > > > I guess I disagree here. We did this once already in Substrait. I > think > > > there is value in an interoperable agreement of what constitutes a > > logical > > > type. Maybe it can live in Substrait then since there is also a > growing > > > collection there of compute function definitions and interpretations. > > > > > > > Even saying that the "string" type should include REE-encoded and > > > dictionary-encoded strings is a stretch today. > > > > > > Why? My semi-formal internal definition has been "We say types T1 and > T2 > > > are the same logical type if there exists a bidirectional 1:1 mapping > > > function M from T1 to T2 such that given every function F and every > > value x > > > we have F(x) = M(F(M(x))". Under this definition dictionary and REE > are > > > just encodings while uint8 and uint16 are different types (since they > > have > > > different results for functions like 100+200). > > > > > > > Things start to break when you start exporting arrays to more naive > > > compute systems that don't support these encodings yet. > > > > > > This is true regardless > > > > > > > it will be unrealistic to expect more than one implementation of such > > > system. > > > > > > Agreed, implementing all the compute kernels is not an Arrow problem. > > > Though maybe those waters have been muddied in the past since Arrow > > > implementations like arrow-cpp and arrow-rs have historically come > with a > > > collection of compute functions. > > > > > > > A semi-opaque format could be designed while allowing slicing > > > > > > Ah, by leaving the buffers untouched but setting the offset? That > makes > > > sense. > > > > > > > Things get trickier when concatenating arrays ... and exporting > arrays > > > ... without leaking data > > > > > > I think we can also add operations like filter and take to that list. > > > > > > On Tue, Dec 16, 2025 at 6:59 PM Felipe Oliveira Carvalho < > > > [email protected]> wrote: > > > > > > > Please don't interpret this as a harsh comment, I mean well and don't > > speak > > > > for the whole community. > > > > > > > > > I think we would need a "semantic schema" or "logical schema" > which > > > > indicates the logical type but not the physical representation. > > > > > > > > This separation between logical and physical gets mentioned a lot, > but > > even > > > > if the type-representation problem is solved, the real challenge is > > having > > > > compute kernels that are complete so they can be flexible enough to > > handle > > > > all the physical forms of a logical type like "string" or have a > smart > > > > enough function dispatcher that performs casts on-the-fly as a > > non-ideal > > > > fallback (e.g. expanding FSST-encoded strings to a more basic and > > widely > > > > supported format when the compute kernel is more naive). > > > > > > > > That is what makes introducing these ideas to Arrow so tricky. Arrow > > being > > > > the data format that shines in interoperability can't have unbounded > > > > complexity. The definition of any logical type can only be coupled > to a > > > > compute system. Datafusion could define all the encodings that form a > > > > logical type, but it's really hard to specify what a logical type is > > meant > > > > to be on every compute system using Arrow. Even saying that the > > "string" > > > > type should include REE-encoded and dictionary-encoded strings is a > > stretch > > > > today. Things start to break when you start exporting arrays to more > > naive > > > > compute systems that don't support these encodings yet. If you care > > less > > > > about multiple implementations of a format, interoperability, and can > > > > provide all compute functions in your closed system, then you can > > greatly > > > > expand the formats and encodings you support. But it will be > > unrealistic to > > > > expect more than one implementation of such system. > > > > > > > > > Arrow users typically expect to be able to perform operations like > > > > "slice" and "take" which require some knowledge of the underlying > type. > > > > > > > > Exactly. There are many simplifying assumptions that can be made when > > one > > > > gets an Arrow RecordBatch and wants to do something with it directly > > (i.e. > > > > without using a library of compute kernels). It's already challenging > > > > enough to get people to stop converting columnar data to row-based > > arrays. > > > > > > > > My recommendation is that we start thinking about proposing formats > and > > > > logical types to "compute systems" and not to the "arrow data > format'. > > IMO > > > > "late materialization" doesn't make sense as an Arrow specification. > > Unless > > > > it's a series of widely useful canonical extension types expressible > > > > through other storage types like binary or fixed-size-binary. > > > > > > > > A compute system (e.g. Datafusion) that intends to implement late > > > > materialization would have to expand operand representation a bit to > > take > > > > in non-materialized data handles in ways that aren't expressible in > the > > > > open Arrow format. > > > > > > > > > Do you think we would come up with a semi-opaque array that could > be > > > > sliced? Or that we would introduce the concept of an unsliceable > > array? > > > > > > > > The String/BinaryView format, when sliced doesn't necessarily stop > > carrying > > > > the data buffers. A semi-opaque format could be designed while > allowing > > > > slicing, Things get trickier when concatenating arrays (which is also > > an > > > > operation that is meant to discard unnecessary data when it > reallocates > > > > buffers) and exporting arrays through the C Data Interface without > > leaking > > > > data that is not necessary in a slice. > > > > > > > > -- > > > > Felipe > > > > > > > > > > > > > > > > > > > > On Thu, Dec 11, 2025 at 10:33 AM Weston Pace <[email protected]> > > > > wrote: > > > > > > > > > I think this is a very interesting idea. This could potentially > > open up > > > > > the door for things like adding compute kernels for these > compressed > > > > > representations to Arrow or Datafusion. Though it isn't without > some > > > > > challenges. > > > > > > > > > > > It seems FSSTStringVector/Array could potentially be modelled > > > > > > as an extension type > > > > > > ... > > > > > > This would however require a fixed dictionary, so might not > > > > > > be desirable. > > > > > > ... > > > > > > ALPFloatingPointVector and bit-packed vectors/arrays are more > > > > challenging > > > > > > to represent as extension types. > > > > > > ... > > > > > > Each batch of values has a different metadata parameter set. > > > > > > > > > > I think these are basically the same problem. From what I've seen > in > > > > > implementations a format will typically introduce some kind of > small > > > > batch > > > > > concept (I think every 1024 values in the Fast Lanes paper IIRC). > So > > > > > either we need individual record batches for each small batch (in > > which > > > > > case the Arrow representation is more straightforward but batches > are > > > > quite > > > > > small) or we need some concept of a batched array in Arrow. If we > > want > > > > > individual record batches for each small batch that requires the > > batch > > > > > sizes to be consistent (in number of rows) between columns and I > > don't > > > > know > > > > > if that's always true. > > > > > > > > > > > One of the discussion items is to allow late materialization: to > > allow > > > > > > keeping data in encoded format beyond the filter stage (for > > example in > > > > > > Datafusion). > > > > > > > > > > > Vortex seems to show that it is possible to support advanced > > > > > > encodings (like ALP, FSST, or others) by separating the logical > > type > > > > > > from the physical encoding. > > > > > > > > > > Pierre brings up another challenge in achieving this goal, which > may > > be > > > > > more significant. The compression and encoding techniques > typically > > vary > > > > > from page to page within Parquet (this is even more true in formats > > like > > > > > fast lanes and vortex). A column might use ALP for one page and > > then use > > > > > PLAIN encoding for the next page. This makes it difficult to > > represent a > > > > > stream of data with the typical Arrow schema we have today. I > think > > we > > > > > would need a "semantic schema" or "logical schema" which indicates > > the > > > > > logical type but not the physical representation. Still, that can > > be an > > > > > orthogonal discussion to FSST and ALP representation. > > > > > > > > > > > We could also experiment with Opaque vectors. > > > > > > > > > > This could be an interesting approach too. I don't know if they > > could be > > > > > entirely opaque though. Arrow users typically expect to be able to > > > > perform > > > > > operations like "slice" and "take" which require some knowledge of > > the > > > > > underlying type. Do you think we would come up with a semi-opaque > > array > > > > > that could be sliced? Or that we would introduce the concept of an > > > > > unsliceable array? > > > > > > > > > > > > > > > On Thu, Dec 11, 2025 at 5:27 AM Pierre Lacave <[email protected]> > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I am relatively new to this space, so I apologize if I am missing > > some > > > > > > context or history here. I wanted to share some observations from > > what > > > > I > > > > > > see happening with projects like Vortex. > > > > > > > > > > > > Vortex seems to show that it is possible to support advanced > > encodings > > > > > > (like ALP, FSST, or others) by separating the logical type from > the > > > > > > physical encoding. If the consumer engine supports the advanced > > > > encoding, > > > > > > it stays compressed and fast. If not, the data is "canonicalized" > > to > > > > > > standard Arrow arrays at the edge. > > > > > > > > > > > > As Parquet adopts these novel encodings, the current Arrow > approach > > > > > forces > > > > > > us to "densify" or decompress data immediately, even if the > engine > > > > could > > > > > > have operated on the encoded data. > > > > > > > > > > > > Is there a world where Arrow could offer some sort of negotiation > > > > > > mechanism? The goal would be to guarantee the data can always be > > read > > > > as > > > > > > standard "safe" physical types (paying a cost only at the > > boundary), > > > > > while > > > > > > allowing systems that understand the advanced encoding to let the > > data > > > > > flow > > > > > > through efficiently. > > > > > > > > > > > > This sounds like it keep the safety of the interoperability - > Arrow > > > > > making > > > > > > sure new encodings have a canonical representation - and it leave > > the > > > > > onus > > > > > > of implemented the efficient flow to the consumer - decoupling > > > > efficiency > > > > > > from interoperability. > > > > > > > > > > > > Thanks ! > > > > > > > > > > > > Pierre > > > > > > > > > > > > On 2025/12/11 06:49:30 Micah Kornfield wrote: > > > > > > > I think this is an interesting idea. Julien, do you have a > > proposal > > > > > for > > > > > > > scope? Is the intent to be 1:1 with any new encoding that is > > added > > > > to > > > > > > > Parquet? For instance would the intent be to also put > cascading > > > > > > encodings > > > > > > > in Arrow? > > > > > > > > > > > > > > We could also experiment with Opaque vectors. > > > > > > > > > > > > > > > > > > > > > Did you mean this as a new type? I think this would be > necessary > > for > > > > > ALP. > > > > > > > > > > > > > > It seems FSSTStringVector/Array could potentially be modelled > as > > an > > > > > > > extension type (dictionary stored as part of the type > metadata?) > > on > > > > top > > > > > > of > > > > > > > a byte array. This would however require a fixed dictionary, so > > might > > > > > not > > > > > > > be desirable. > > > > > > > > > > > > > > ALPFloatingPointVector and bit-packed vectors/arrays are more > > > > > challenging > > > > > > > to represent as extension types. > > > > > > > > > > > > > > 1. There is no natural alignment with any of the existing > types > > (and > > > > > the > > > > > > > bit-packing width can effectively vary by batch). > > > > > > > 2. Each batch of values has a different metadata parameter > set. > > > > > > > > > > > > > > So it seems there is no easy way out for the ALP encoding and > we > > > > either > > > > > > > need to pay the cost of adding a new type (which is not > > necessarily > > > > > > > trivial) or we would have to do some work to literally make a > new > > > > > opaque > > > > > > > "Custom" Type, which would have a buffer that is only > > interpretable > > > > > based > > > > > > > on its extension type. An easy way of shoe-horning this in > > would be > > > > to > > > > > > add > > > > > > > a ParquetScalar extension type, which simply contains the > > > > decompressed > > > > > > but > > > > > > > encoded Parquet page with repetition and definition levels > > stripped > > > > > out. > > > > > > > The latter also has its obvious down-sides. > > > > > > > > > > > > > > Cheers, > > > > > > > Micah > > > > > > > > > > > > > > [1] > > https://github.com/apache/arrow/blob/main/format/Schema.fbs#L160 > > > > > > > [2] https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf > > > > > > > > > > > > > > On Wed, Dec 10, 2025 at 5:44 PM Julien Le Dem < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > I forgot to mention that those encodings have the > > particularity of > > > > > > allowing > > > > > > > > random access without decoding previous values. > > > > > > > > > > > > > > > > On Wed, Dec 10, 2025 at 5:40 PM Julien Le Dem < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > Parquet is in the process of adopting new encodings [1] > > > > (Currently > > > > > > in POC > > > > > > > > > stage), specifically ALP [2] and FSST [3]. > > > > > > > > > One of the discussion items is to allow late > > materialization: to > > > > > > allow > > > > > > > > > keeping data in encoded format beyond the filter stage (for > > > > example > > > > > > in > > > > > > > > > Datafusion). > > > > > > > > > There are several advantages to this: > > > > > > > > > - For example, if I summarize FSST as a variation of > > dictionary > > > > > > encoding > > > > > > > > > on substrings in the values, one can evaluate some > > operations on > > > > > > encoded > > > > > > > > > values without decoding them, saving memory and CPU. > > > > > > > > > - Similarly, simplifying for brevity, ALP converts floating > > point > > > > > > values > > > > > > > > > to small integers that are then bitpacked. > > > > > > > > > The Vortex project argues that keeping encoded values in > > > > in-memory > > > > > > > > vectors > > > > > > > > > opens up opportunities for performance improvements. [4] a > > third > > > > > > party > > > > > > > > blog > > > > > > > > > argues it's a problem as well [5] > > > > > > > > > > > > > > > > > > So I wanted to start a discussion to suggest, we might > > consider > > > > > > adding > > > > > > > > > some additional vectors to support such encoded Values like > > an > > > > > > > > > FSSTStringVector for example. This would not be too > different > > > > from > > > > > > the > > > > > > > > > dictionary encoding, or an ALPFloatingPointVector with a > bit > > > > packed > > > > > > > > scheme > > > > > > > > > not too different from what we use for nullability. > > > > > > > > > We could also experiment with Opaque vectors. > > > > > > > > > > > > > > > > > > For reference, similarly motivated improvements have been > > done in > > > > > the > > > > > > > > past > > > > > > > > > [6] > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > See: > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/tree/master/proposals#active-proposals > > > > > > > > > [2] https://github.com/apache/arrow/pull/48345 > > > > > > > > > [3] https://github.com/apache/arrow/pull/48232 > > > > > > > > > [4] https://docs.vortex.dev/#in-memory > > > > > > > > > [5] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://www.polarsignals.com/blog/posts/2025/11/25/interface-parquet-vortex > > > > > > > > > [6] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
