[PR] Set aggregation hash seed [datafusion]

via GitHub Fri, 23 May 2025 05:12:40 -0700


ctsk opened a new pull request, #16165:
URL: https://github.com/apache/datafusion/pull/16165


   This PR hard-codes the seed for the hash aggregation. The main benefit 
compared to the previously runtime-determined seed is that after applying this 
PR, partial aggregation and final aggregation will share the same hash function.
   
   I haven't measured it, but in theory, this should make the final aggregation 
step more efficient, because the partial aggregation will emit the group values 
in a way that will be clustered in the final aggregation hash table - thus 
causing a benefitial memory access pattern when building the final aggregation.
   
   I expect it speeds up large-cardinality aggregations that don't trigger the 
skipping of the partial aggregation step a tiny bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Set aggregation hash seed [datafusion]

Reply via email to