acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2782983563
I sincerely apologize for the delay in updating this PR. I have now designed a detailed comparative test for the `bool_or/bool_and` issue described by @Dandandan. The related results are as follows. I’m not sure what you think about these results — should we continue using `true_count/false_count`, or look for a better solution? @alamb @Dandandan 1. I designed six distribution scenarios([code](https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5eR37-R120)) for the `true_count/false_count` and `bool_or/bool_and` of the boolean array to conduct performance testing comparisons. The final results are as follows: | Test Case | true_count/false_count | bool_or/bool_and | Change | Performance Impact | | ---------------------- | ---------------------- | --------------------------- | ---------------- | ------------------ | | all_false | 1.6593 µs - 1.6695 µs | 4.1013 µs - 4.1774 µs | 150.91% | 🚫 Regressed | | one_true_first | 1.6726 µs - 1.6771 µs | **1.6885 ns - 1.7430 ns** | **-99.898%** | ✅ Improved | | one_true_last | 1.6663 µs - 1.6714 µs | 4.1096 µs - 4.1554 µs | +147.14% | 🚫 Regressed | | one_true_middle | 1.6723 µs - 1.6819 µs | 2.1505 µs - 2.2117 µs | +30.180% | 🚫 Regressed | | one_true_middle_left | 1.6672 µs - 1.6727 µs | 1.1088 µs - 1.1483 µs | -32.995% | ✅ Improved | | one_true_middle_right | 1.6689 µs - 1.6741 µs | 3.1562 µs - 3.2521 µs | +93.762% | 🚫 Regressed | | all_true | 1.6711 µs - 1.6747 µs | 4.2779 µs - 4.4291 µs | +155.35% | 🚫 Regressed | | one_false_first | 1.6722 µs - 1.6782 µs | **1.6278 ns - 1.6360 ns** | **-99.903%** | ✅ Improved | | one_false_last | 1.6818 µs - 1.7512 µs | 4.2175 µs - 4.2930 µs | +153.50% | 🚫 Regressed | | one_false_middle | 1.8437 µs - 1.9665 µs | 2.0575 µs - 2.0871 µs | +11.931% | 🚫 Regressed | | one_false_middle_left | 2.0004 µs - 2.3194 µs | 1.0243 µs - 1.0398 µs | -57.059% | ✅ Improved | | one_false_middle_right | 2.0770 µs - 2.2721 µs | 3.0275 µs - 3.0582 µs | +47.668% | 🚫 Regressed | It can be seen that when `false/true` is located slightly to the left of the middle, `bool_or/bool_and` has a significant advantage, with up to 10^3 times the performance lead when in the first position. However, in other cases, using `true_count/false_count` performs better, showing relatively stable behavior across various scenarios. 2. The test can be reproduced with the following command: ```sh # test true_count/false_count TEST_BOOL_COUNT=1 cargo bench --bench boolean_op # test bool_or/bool_and cargo bench --bench boolean_op ``` detail benchmark code: https://github.com/apache/datafusion/pull/15462/files#diff-8710f6b44dd74240d19e6fcdfbf971c034f59cd48c022e51b07ed876a8cc7c5e -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org