acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2782983563

   I sincerely apologize for the delay in updating this PR. I have now designed 
a detailed comparative test for the `bool_or/bool_and` issue described by 
@Dandandan. The related results are as follows. I’m not sure what you think 
about these results — should we continue using `true_count/false_count`, or 
look for a better solution? @alamb @Dandandan
   
   
   1. I designed six distribution 
scenarios([code](https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5eR37-R120))
 for the `true_count/false_count` and `bool_or/bool_and` of the boolean array 
to conduct performance testing comparisons. The final results are as follows:
   
   | Test Case              | true_count/false_count | bool_or/bool_and         
   | Change           | Performance Impact |
   | ---------------------- | ---------------------- | 
--------------------------- | ---------------- | ------------------ |
   | all_false              | 1.6593 µs - 1.6695 µs  | 4.1013 µs - 4.1774 µs    
   | ​150.91%​​       | 🚫 Regressed       |
   | one_true_first         | 1.6726 µs - 1.6771 µs  | ​**1.6885 ns - 1.7430 
ns**​ | ​**​-99.898%​**​ | ✅ Improved         |
   | one_true_last          | 1.6663 µs - 1.6714 µs  | 4.1096 µs - 4.1554 µs    
   | +147.14%         | 🚫 Regressed       |
   | one_true_middle        | 1.6723 µs - 1.6819 µs  | 2.1505 µs - 2.2117 µs    
   | +30.180%         | 🚫 Regressed       |
   | one_true_middle_left   | 1.6672 µs - 1.6727 µs  | 1.1088 µs - 1.1483 µs    
   | -32.995%         | ✅ Improved         |
   | one_true_middle_right  | 1.6689 µs - 1.6741 µs  | 3.1562 µs - 3.2521 µs    
   | +93.762%         | 🚫 Regressed       |
   | all_true               | 1.6711 µs - 1.6747 µs  | 4.2779 µs - 4.4291 µs    
   | +155.35%         | 🚫 Regressed       |
   | one_false_first        | 1.6722 µs - 1.6782 µs  | ​**1.6278 ns - 1.6360 
ns**​ | ​**​-99.903%​**​ | ✅ Improved         |
   | one_false_last         | 1.6818 µs - 1.7512 µs  | 4.2175 µs - 4.2930 µs    
   | +153.50%         | 🚫 Regressed       |
   | one_false_middle       | 1.8437 µs - 1.9665 µs  | 2.0575 µs - 2.0871 µs    
   | +11.931%         | 🚫 Regressed       |
   | one_false_middle_left  | 2.0004 µs - 2.3194 µs  | 1.0243 µs - 1.0398 µs    
   | -57.059%         | ✅ Improved         |
   | one_false_middle_right | 2.0770 µs - 2.2721 µs  | 3.0275 µs - 3.0582 µs    
   | +47.668%         | 🚫 Regressed       |
   
   It can be seen that when `false/true` is located slightly to the left of the 
middle, `bool_or/bool_and` has a significant advantage, with up to 10^3 times 
the performance lead when in the first position.  
   
   However, in other cases, using `true_count/false_count` performs better, 
showing relatively stable behavior across various scenarios.
   
   2. The test can be reproduced with the following command:
   ```sh
   # test true_count/false_count 
   TEST_BOOL_COUNT=1 cargo bench --bench boolean_op
   # test bool_or/bool_and 
   cargo bench --bench boolean_op
   ```
   detail benchmark code: 
https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5e


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to