acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437
I have an idea that might improve the effectiveness of short-circuit optimization, and it seems necessary to use `false_count` for evaluation counting. The current issue with DataFusion's execution of `BinaryExpr`: After computing the `left` side, the result is not immediately used to filter the batch to reduce the input size of the `right` batch. Example: ``` Current: batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] batch:[1,2,3,4] -> execute right -> bool array: [true,true,false,false] Might be better: batch:[1,2,3,4] -> execute left -> bool array: [true,false,true,false] -> batch:[1,3] batch:[1,3] -> execute right -> bool array: [true,false] -> batch:[1] ``` I tried implementing this process using [evaluate_selection](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_selection), but the performance regressed in many cases because its internal implementation requires copying to create a new `RecordBatch`. However, perhaps we could heuristically decide whether to pre-filter the `RecordBatch` based on `false_count`, for example, when `false_count / array_len > 0.8`. By the way, I recently looked into ClickHouse's execution logic for `BinaryOp`. It immediately uses the result of each expression to filter and then proceeds to execute the next expression. Similarly, it involves copying, but it accelerates the checking and copying process using SIMD instructions. I also noticed that arrow-rs, which DataFusion uses, has a very efficient approach for this process: [IterationStrategy](https://github.com/apache/arrow-rs/blob/9322547590ab32efeff8c0486e4a3a2cb5887a26/arrow-select/src/filter.rs#L296-L318). I don't know if you think it's a good idea? @alamb @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org