Dandandan opened a new pull request, #22352: URL: https://github.com/apache/datafusion/pull/22352
## Which issue does this PR close? <!-- No issue yet; happy to file one if preferred. --> - Closes #. ## Rationale for this change When multiple group keys participate in a grouped aggregate, `create_hashes` walks the columns left-to-right and rehashes each row by combining the current value's hash with the previously accumulated row hash. The generic variable-width path (`hash_array`) and the dictionary path (`hash_dictionary_array`) already do this with `combine_hashes(value.hash_one(random_state), *hash)`. The byte-view paths (`hash_string_view_array_inner` and the no-buffers rehash arm of `hash_generic_byte_view_array`) instead construct a fresh `foldhash::fast::SeedableRandomState` per row, seeded from the previous hash, and run the value through that. That approach is heavier on the hot loop and isn't necessary for correctness — the same hash distribution is achievable with `combine_hashes`. The motivating workloads are ClickBench q16 and q17, which both group by `UserID, SearchPhrase`. `SearchPhrase` is a `Utf8View`, so the second column rehashes through `hash_string_view_array_inner` for every row. On current-main baselines of **1106.69 ms / 1127.64 ms** (`dfbench clickbench --query 16 --iterations 3` / `--query 17 --iterations 3`, partitioned dataset), 3-iteration runs with this change settled around: | Query | Current main avg ms | Patched avg ms | Delta | | --- | ---: | ---: | ---: | | q16 | 1106.69 | 986.76 | **-10.8%** | | q17 | 1127.64 | 1002.29 | **-11.1%** | Guardrail single-query runs of other byte-view second-column rehash workloads (q14, q18, q39) did not regress. This change is independent of, and stackable with, https://github.com/apache/datafusion/pull/22350 (the same idiom applied to `hash_array_primitive`). After both land, `seeded_state` becomes unused and can be removed in a small follow-up. ## What changes are included in this PR? Replaces the three `seeded_state(...).build_hasher() / value.hash_write(...) / hasher.finish()` blocks in the byte-view rehash paths: - `hash_string_view_array_inner`: inlined-view rehash branch and out-of-line bytes rehash branch - `hash_generic_byte_view_array`: `(no nulls, no buffers, rehash)` specialization Each is replaced with `combine_hashes(value.hash_one(random_state), *hash)`, matching the surrounding generic and dictionary paths. `seeded_state` itself and `hash_array_primitive` are unchanged. ## Are these changes tested? Covered by existing tests in `datafusion-common::hash_utils`, including the multi-column / string-view tests: - `cargo test -p datafusion-common hash_utils` - `cargo clippy -p datafusion-common --all-targets --all-features -- -D warnings` ## Are there any user-facing changes? No. This is an internal performance change with no API or behavioral impact — hash distribution properties are unchanged because `combine_hashes` is already the standard pattern in this file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
