Re: [PR] Optimize Dictionary groupings [datafusion]

via GitHub Sun, 14 Jun 2026 19:35:00 -0700


Rich-T-kid commented on PR #21765:
URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4704021677

> > This provides a performance increase in every case where dictionary
arrays are expected to be used — low to medium cardinality. If the data is
extremely high cardinality, dictionary arrays are the wrong data type. I don't
think it's possible to get the best of both worlds in this case.
>
> Is there any way to reduce the overhead so the difference is not as much?

There are spots that can be further optimized for example, storing a single
byte buffer instead of a vector of vectors. This avoids a layer of indirection
and the n-element shifts that occur
[here](https://github.com/apache/datafusion/pull/21765/changes#diff-4515126fc8522ddcc05a5673023bf83367b0c21e31f2933e792d42da8f2d6a63R402).
I've tracked this in a separate
[issue](https://github.com/apache/datafusion/issues/22078), which provides a
bit more detail. That said, this is an optimization on top of the current
approach and won't change the fundamental characteristic of being slower with
high-cardinality data. Do you have ideas on how to further close the gap with
`GroupValueRows`?

@alamb
cc @kumarUjjawal

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize Dictionary groupings [datafusion]

Reply via email to