Rich-T-kid commented on PR #21765:
URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4704021677

   > > This provides a performance increase in every case where dictionary 
arrays are expected to be used — low to medium cardinality. If the data is 
extremely high cardinality, dictionary arrays are the wrong data type. I don't 
think it's possible to get the best of both worlds in this case.
   > 
   > Is there any way to reduce the overhead so the difference is not as much?
   
   There are spots that can be further optimized for example, storing a single 
byte buffer instead of a vector of vectors. This avoids a layer of indirection 
and the n-element shifts that occur 
[here](https://github.com/apache/datafusion/pull/21765/changes#diff-4515126fc8522ddcc05a5673023bf83367b0c21e31f2933e792d42da8f2d6a63R402).
 I've tracked this in a separate 
[issue](https://github.com/apache/datafusion/issues/22078), which provides a 
bit more detail. That said, this is an optimization on top of the current 
approach and won't change the fundamental characteristic of being slower with 
high-cardinality data. Do you have ideas on how to further close the gap with 
`GroupValueRows`?
   
   @alamb 
   cc @kumarUjjawal 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to