[PR] perf: combine_hashes for byte-view multi-column rehash [datafusion]

via GitHub Mon, 18 May 2026 22:51:54 -0700


Dandandan opened a new pull request, #22352:
URL: https://github.com/apache/datafusion/pull/22352


   ## Which issue does this PR close?
   
   <!-- No issue yet; happy to file one if preferred. -->
   
   - Closes #.
   
   ## Rationale for this change
   
   When multiple group keys participate in a grouped aggregate, `create_hashes` 
walks the columns left-to-right and rehashes each row by combining the current 
value's hash with the previously accumulated row hash. The generic 
variable-width path (`hash_array`) and the dictionary path 
(`hash_dictionary_array`) already do this with 
`combine_hashes(value.hash_one(random_state), *hash)`. The byte-view paths 
(`hash_string_view_array_inner` and the no-buffers rehash arm of 
`hash_generic_byte_view_array`) instead construct a fresh 
`foldhash::fast::SeedableRandomState` per row, seeded from the previous hash, 
and run the value through that. That approach is heavier on the hot loop and 
isn't necessary for correctness — the same hash distribution is achievable with 
`combine_hashes`.
   
   The motivating workloads are ClickBench q16 and q17, which both group by 
`UserID, SearchPhrase`. `SearchPhrase` is a `Utf8View`, so the second column 
rehashes through `hash_string_view_array_inner` for every row. On current-main 
baselines of **1106.69 ms / 1127.64 ms** (`dfbench clickbench --query 16 
--iterations 3` / `--query 17 --iterations 3`, partitioned dataset), 
3-iteration runs with this change settled around:
   
   | Query | Current main avg ms | Patched avg ms | Delta |
   | --- | ---: | ---: | ---: |
   | q16 | 1106.69 | 986.76 | **-10.8%** |
   | q17 | 1127.64 | 1002.29 | **-11.1%** |
   
   Guardrail single-query runs of other byte-view second-column rehash 
workloads (q14, q18, q39) did not regress.
   
   This change is independent of, and stackable with, 
https://github.com/apache/datafusion/pull/22350 (the same idiom applied to 
`hash_array_primitive`). After both land, `seeded_state` becomes unused and can 
be removed in a small follow-up.
   
   ## What changes are included in this PR?
   
   Replaces the three `seeded_state(...).build_hasher() / value.hash_write(...) 
/ hasher.finish()` blocks in the byte-view rehash paths:
   
   - `hash_string_view_array_inner`: inlined-view rehash branch and out-of-line 
bytes rehash branch
   - `hash_generic_byte_view_array`: `(no nulls, no buffers, rehash)` 
specialization
   
   Each is replaced with `combine_hashes(value.hash_one(random_state), *hash)`, 
matching the surrounding generic and dictionary paths. `seeded_state` itself 
and `hash_array_primitive` are unchanged.
   
   ## Are these changes tested?
   
   Covered by existing tests in `datafusion-common::hash_utils`, including the 
multi-column / string-view tests:
   
   - `cargo test -p datafusion-common hash_utils`
   - `cargo clippy -p datafusion-common --all-targets --all-features -- -D 
warnings`
   
   ## Are there any user-facing changes?
   
   No. This is an internal performance change with no API or behavioral impact 
— hash distribution properties are unchanged because `combine_hashes` is 
already the standard pattern in this file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: combine_hashes for byte-view multi-column rehash [datafusion]

Reply via email to