[PR] perf: combine_hashes for primitive multi-column rehash [datafusion]

via GitHub Mon, 18 May 2026 22:31:02 -0700


Dandandan opened a new pull request, #22350:
URL: https://github.com/apache/datafusion/pull/22350


   ## Which issue does this PR close?
   
   <!-- No issue yet; happy to file one if preferred. -->
   
   - Closes #.
   
   ## Rationale for this change
   
   When multiple group keys participate in a grouped aggregate, `create_hashes` 
walks the columns left-to-right and rehashes each row by combining the current 
value's hash with the previously accumulated row hash. The generic 
variable-width path (`hash_array`) and the dictionary path 
(`hash_dictionary_array`) already do this with 
`combine_hashes(value.hash_one(random_state), *hash)`. The primitive path 
(`hash_array_primitive`) instead constructs a fresh 
`foldhash::fast::SeedableRandomState` per row, seeded from the previous hash, 
and runs the value through that. That approach is heavier on the hot loop and 
isn't necessary for correctness — the same hash distribution is achievable with 
`combine_hashes`.
   
   The motivating workload is ClickBench q32 on the partitioned dataset:
   
   ```sql
   SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), 
AVG("ResolutionWidth")
   FROM hits
   GROUP BY "WatchID", "ClientIP"
   ORDER BY c DESC
   LIMIT 10
   ```
   
   The two group keys are both primitive (`Int64`/`Int32`), and the partial 
aggregate spends most of its time in group-id calculation. On a current-main 
baseline of **3988.27 ms** (`dfbench clickbench --query 32 --iterations 3`), a 
10-iteration same-machine run with this change settled around **1.7 s**, with a 
warm tail of **1.4-1.5 s** per iteration.
   
   Guardrail single-query runs of other primitive multi-key aggregates (q30, 
q31, q35) and the q18 path that benefits from primitive rehashing as a second 
column did not regress.
   
   ## What changes are included in this PR?
   
   Replaces the two `seeded_state(...).build_hasher() / value.hash_write(...) / 
hasher.finish()` blocks in `hash_array_primitive`'s rehash branch (no-null and 
nullable variants) with the same `combine_hashes(value.hash_one(random_state), 
*hash)` pattern already used by the surrounding generic and dictionary paths. 
`seeded_state` itself is unchanged and still used by the byte-view hash paths.
   
   Doc comment on `hash_array_primitive` updated to match the new behavior.
   
   ## Are these changes tested?
   
   Covered by existing tests in `datafusion-common::hash_utils`, including the 
multi-column hash tests:
   
   - `cargo test -p datafusion-common hash_utils`
   - `cargo clippy -p datafusion-common --all-targets --all-features -- -D 
warnings`
   
   ## Are there any user-facing changes?
   
   No. This is an internal performance change with no API or behavioral impact 
— hash distribution properties are unchanged because `combine_hashes` is 
already the standard pattern in this file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: combine_hashes for primitive multi-column rehash [datafusion]

Reply via email to