neilconway commented on PR #21064:
URL: https://github.com/apache/datafusion/pull/21064#issuecomment-4092608736
This version only uses the short-string optimization if all of the strings
in the StringViewArray are short. I tried a variant where we apply the
optimization on a per-string basis:
```rust
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
let array: &StringViewArray = downcast_value!(values[0],
StringViewArray);
// When strings are stored inline in the StringView (≤ 12 bytes),
// hash the raw u128 view directly instead of materializing a &str.
if array.data_buffers().is_empty() {
// All strings are inline — skip per-element length check
for (i, &view) in array.views().iter().enumerate() {
if !array.is_null(i) {
self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
}
}
} else {
for (i, &view) in array.views().iter().enumerate() {
if array.is_null(i) {
continue;
}
if (view as u32) <= 12 {
self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
} else {
// SAFETY: i is within bounds, checked non-null above
let s = unsafe { array.value_unchecked(i) };
self.hll.add(s);
}
}
}
Ok(())
}
```
But this version regresses long strings by ~5, which was stable over
multiple runs:
```
┌──────────────────────┬──────────┬───────────┬───────────────┐
│ Benchmark │ Baseline │ Optimized │ Change │
├──────────────────────┼──────────┼───────────┼───────────────┤
│ utf8view short 80% │ 12.16 µs │ 6.98 µs │ -42.5% │
├──────────────────────┼──────────┼───────────┼───────────────┤
│ utf8view short 99% │ 12.10 µs │ 6.96 µs │ -42.5% │
├──────────────────────┼──────────┼───────────┼───────────────┤
│ utf8view long 80% │ 20.19 µs │ 21.11 µs │ +5.3% │
├──────────────────────┼──────────┼───────────┼───────────────┤
│ utf8view long 99% │ 20.17 µs │ 21.05 µs │ +4.3% │
├──────────────────────┼──────────┼───────────┼───────────────┤
```
Whereas the version in the PR avoids the regression on long strings, at the
price of only applying the optimization if all strings in the StringViewArray
are short:
```
┌────────────────────┬──────────┬───────────┬───────────┐
│ Benchmark │ Baseline │ Optimized │ Change │
├────────────────────┼──────────┼───────────┼───────────┤
│ utf8view short 80% │ 12.16 µs │ 7.14 µs │ -41.3% │
├────────────────────┼──────────┼───────────┼───────────┤
│ utf8view short 99% │ 12.10 µs │ 7.11 µs │ -41.3% │
├────────────────────┼──────────┼───────────┼───────────┤
│ utf8view long 80% │ 20.19 µs │ 20.09 µs │ no change │
├────────────────────┼──────────┼───────────┼───────────┤
│ utf8view long 99% │ 20.17 µs │ 19.88 µs │ no change │
└────────────────────┴──────────┴───────────┴───────────┘
```
Lmk if you have a preference, or if you can see a way to get the best of
both worlds here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]