Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

via GitHub Sat, 12 Apr 2025 08:18:25 -0700


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798871646


   > > [@acking-you](https://github.com/acking-you) the code needs to be 
extended to support nulls (you can take a look at the true_count implementation 
in arrow-rs to do this efficiently).
   > 
   > I have an idea for an early exit losing performance, and I'm trying it 
out. If it works, I'll post the code
   
   @Dandandan You're absolutely right. At this point, we should no longer focus 
on optimizing the early return for the `true_count` calculation. Instead, we 
should consider extending support to new optimization scenarios, such as 
handling `nulls` and other cases. Here’s why:
   
   ## My Attempt
   I experimented with optimizing the `true_count` calculation using SIMD 
instructions, performing 256-bit comparisons and computations while 
incorporating early returns. This significantly reduced the overhead of 
checking boolean arrays in non-short-circuit scenarios (with a performance 
boost of up to 6x in extreme cases). However, in short-circuit scenarios like 
`all false` or `all true`, it still performed about 20% slower than the 
existing `false_count/true_count` implementation.
   
   When I applied this optimization to the execution of `BinaryExpr` and 
benchmarked the entire `BinaryExpr`, I observed a performance regression. The 
reasons are as follows:
   1. The computation of `true_count` for boolean arrays is only on the order 
of 60ns, and even with early exit optimizations, it remains at the 10ns level 
(in extreme cases).
   2. When `BinaryExpr` does not trigger short-circuiting, its execution time 
is on the order of 400us (depending on the complexity of the expression). 
Optimizing the `true_count` calculation for boolean arrays becomes negligible 
in this context.
   3. When short-circuiting occurs, the time complexity of `BinaryExpr` is on 
par with the computation of the boolean array. In this case, calling 
`true_count` remains the optimal solution.
   
   ## Other Optimization Scenarios
   - Extending short-circuit optimizations to handle `nulls`.
   - The early execution of selection that I previously mentioned: 
[https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437](https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437).
   
   I realized that my earlier conclusions about the benefits of early selection 
execution were incorrect. After testing with the latest code changes (on a 
machine with an AMD 9950x CPU), the results are as follows: all scenarios that 
benefited from this optimization saw a reduction in execution time from `450us` 
to `170us`. This is a significant improvement, and it did not slow down any 
other cases!
   
   ```sh
   short_circuit/and/one_true_first
                           time:   [188.36 µs 188.96 µs 189.55 µs]
                           change: [-58.870% -58.652% -58.420%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   short_circuit/and/one_true_last
                           time:   [170.10 µs 171.04 µs 171.97 µs]
                           change: [-62.788% -62.520% -62.258%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   short_circuit/and/one_true_middle
                           time:   [165.38 µs 165.97 µs 166.57 µs]
                           change: [-64.176% -63.968% -63.784%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   short_circuit/and/one_true_middle_left
                           time:   [165.02 µs 165.48 µs 165.96 µs]
                           change: [-63.671% -63.458% -63.204%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   short_circuit/and/one_true_middle_right
                           time:   [169.09 µs 169.61 µs 170.13 µs]
                           change: [-63.298% -63.089% -62.868%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   short_circuit/and/one_true_middle_right #2
                           time:   [166.43 µs 167.12 µs 167.85 µs]
                           change: [-64.092% -63.916% -63.751%] (p = 0.00 < 
0.05)
                           Performance has improved.
   ```
   
   I further optimized the work done by @kosiew. Relevant PR: 
https://github.com/apache/datafusion/pull/15694 (currently, there are still 
some issues with `cargo test`, and I'm working on fixing them).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

Reply via email to