Rich-T-kid commented on PR #21765: URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4704021677
> > This provides a performance increase in every case where dictionary arrays are expected to be used — low to medium cardinality. If the data is extremely high cardinality, dictionary arrays are the wrong data type. I don't think it's possible to get the best of both worlds in this case. > > Is there any way to reduce the overhead so the difference is not as much? There are spots that can be further optimized for example, storing a single byte buffer instead of a vector of vectors. This avoids a layer of indirection and the n-element shifts that occur [here](https://github.com/apache/datafusion/pull/21765/changes#diff-4515126fc8522ddcc05a5673023bf83367b0c21e31f2933e792d42da8f2d6a63R402). I've tracked this in a separate [issue](https://github.com/apache/datafusion/issues/22078), which provides a bit more detail. That said, this is an optimization on top of the current approach and won't change the fundamental characteristic of being slower with high-cardinality data. Do you have ideas on how to further close the gap with `GroupValueRows`? @alamb cc @kumarUjjawal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
