Re: [PR] perf: Optimize `approx_distinct` for inline Utf8View [datafusion]

via GitHub Thu, 19 Mar 2026 12:03:03 -0700


neilconway commented on PR #21064:
URL: https://github.com/apache/datafusion/pull/21064#issuecomment-4092608736


   This version only uses the short-string optimization if all of the strings 
in the StringViewArray are short. I tried a variant where we apply the 
optimization on a per-string basis:
   
   ```rust
     fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
         let array: &StringViewArray = downcast_value!(values[0], 
StringViewArray);
   
         // When strings are stored inline in the StringView (≤ 12 bytes),
         // hash the raw u128 view directly instead of materializing a &str.
         if array.data_buffers().is_empty() {
             // All strings are inline — skip per-element length check
             for (i, &view) in array.views().iter().enumerate() {
                 if !array.is_null(i) {
                     self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
                 }
             }
         } else {
             for (i, &view) in array.views().iter().enumerate() {
                 if array.is_null(i) {
                     continue;
                 }
                 if (view as u32) <= 12 {
                     self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
                 } else {
                     // SAFETY: i is within bounds, checked non-null above
                     let s = unsafe { array.value_unchecked(i) };
                     self.hll.add(s);
                 }
             }
         }
   
         Ok(())
     }
   ```
   
   But this version regresses long strings by ~5, which was stable over 
multiple runs:
   
   ```
     ┌──────────────────────┬──────────┬───────────┬───────────────┐
     │      Benchmark       │ Baseline │ Optimized │    Change     │
     ├──────────────────────┼──────────┼───────────┼───────────────┤
     │ utf8view short 80%   │ 12.16 µs │ 6.98 µs   │ -42.5%        │
     ├──────────────────────┼──────────┼───────────┼───────────────┤
     │ utf8view short 99%   │ 12.10 µs │ 6.96 µs   │ -42.5%        │
     ├──────────────────────┼──────────┼───────────┼───────────────┤
     │ utf8view long 80%    │ 20.19 µs │ 21.11 µs  │ +5.3%         │
     ├──────────────────────┼──────────┼───────────┼───────────────┤
     │ utf8view long 99%    │ 20.17 µs │ 21.05 µs  │ +4.3%         │
     ├──────────────────────┼──────────┼───────────┼───────────────┤
     ```
   
   Whereas the version in the PR avoids the regression on long strings, at the 
price of only applying the optimization if all strings in the StringViewArray 
are short:
   
   ```
     ┌────────────────────┬──────────┬───────────┬───────────┐
     │     Benchmark      │ Baseline │ Optimized │  Change   │
     ├────────────────────┼──────────┼───────────┼───────────┤
     │ utf8view short 80% │ 12.16 µs │ 7.14 µs   │ -41.3%    │
     ├────────────────────┼──────────┼───────────┼───────────┤
     │ utf8view short 99% │ 12.10 µs │ 7.11 µs   │ -41.3%    │
     ├────────────────────┼──────────┼───────────┼───────────┤
     │ utf8view long 80%  │ 20.19 µs │ 20.09 µs  │ no change │
     ├────────────────────┼──────────┼───────────┼───────────┤
     │ utf8view long 99%  │ 20.17 µs │ 19.88 µs  │ no change │
     └────────────────────┴──────────┴───────────┴───────────┘
   ```
   
   Lmk if you have a preference, or if you can see a way to get the best of 
both worlds here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Optimize `approx_distinct` for inline Utf8View [datafusion]

Reply via email to