Dandandan opened a new pull request, #22350:
URL: https://github.com/apache/datafusion/pull/22350
## Which issue does this PR close?
<!-- No issue yet; happy to file one if preferred. -->
- Closes #.
## Rationale for this change
When multiple group keys participate in a grouped aggregate, `create_hashes`
walks the columns left-to-right and rehashes each row by combining the current
value's hash with the previously accumulated row hash. The generic
variable-width path (`hash_array`) and the dictionary path
(`hash_dictionary_array`) already do this with
`combine_hashes(value.hash_one(random_state), *hash)`. The primitive path
(`hash_array_primitive`) instead constructs a fresh
`foldhash::fast::SeedableRandomState` per row, seeded from the previous hash,
and runs the value through that. That approach is heavier on the hot loop and
isn't necessary for correctness — the same hash distribution is achievable with
`combine_hashes`.
The motivating workload is ClickBench q32 on the partitioned dataset:
```sql
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"),
AVG("ResolutionWidth")
FROM hits
GROUP BY "WatchID", "ClientIP"
ORDER BY c DESC
LIMIT 10
```
The two group keys are both primitive (`Int64`/`Int32`), and the partial
aggregate spends most of its time in group-id calculation. On a current-main
baseline of **3988.27 ms** (`dfbench clickbench --query 32 --iterations 3`), a
10-iteration same-machine run with this change settled around **1.7 s**, with a
warm tail of **1.4-1.5 s** per iteration.
Guardrail single-query runs of other primitive multi-key aggregates (q30,
q31, q35) and the q18 path that benefits from primitive rehashing as a second
column did not regress.
## What changes are included in this PR?
Replaces the two `seeded_state(...).build_hasher() / value.hash_write(...) /
hasher.finish()` blocks in `hash_array_primitive`'s rehash branch (no-null and
nullable variants) with the same `combine_hashes(value.hash_one(random_state),
*hash)` pattern already used by the surrounding generic and dictionary paths.
`seeded_state` itself is unchanged and still used by the byte-view hash paths.
Doc comment on `hash_array_primitive` updated to match the new behavior.
## Are these changes tested?
Covered by existing tests in `datafusion-common::hash_utils`, including the
multi-column hash tests:
- `cargo test -p datafusion-common hash_utils`
- `cargo clippy -p datafusion-common --all-targets --all-features -- -D
warnings`
## Are there any user-facing changes?
No. This is an internal performance change with no API or behavioral impact
— hash distribution properties are unchanged because `combine_hashes` is
already the standard pattern in this file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]