alamb commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019123741
########## benchmarks/queries/clickbench/README.md: ########## @@ -120,13 +122,42 @@ LIMIT 10; ``` Results look like - +``` +-------------+---------------------+---+------+------+------+ | ClientIP | WatchID | c | tmin | tp95 | tmax | +-------------+---------------------+---+------+------+------+ | 1611957945 | 6655575552203051303 | 2 | 0 | 0 | 0 | | -1402644643 | 8566928176839891583 | 2 | 0 | 0 | 0 | +-------------+---------------------+---+------+------+------+ +``` + +### Q6: How many social shares meet complex multi-stage filtering criteria? +**Question**: What is the count of sharing actions from iPhone mobile users on specific social networks, within common timezones, participating in seasonal campaigns, with high screen resolutions and closely matched UTM parameters? +**Important Query Properties**: Simple filter with high-selectivity, Costly string matching, A large number of filters with high overhead are positioned relatively later in the process Review Comment: 👍 ########## datafusion/physical-expr/src/expressions/binary.rs: ########## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { use arrow::compute::kernels::numeric::*; + fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { + let data_type = arg.data_type(); + match (data_type, op) { + (DataType::Boolean, Operator::And) => { + match arg { + ColumnarValue::Array(array) => { + if let Ok(array) = as_boolean_array(&array) { + return array.false_count() == array.len(); + } + } + ColumnarValue::Scalar(scalar) => { + if let ScalarValue::Boolean(Some(value)) = scalar { + return !value; + } + } + } + false + } + (DataType::Boolean, Operator::Or) => { + match arg { + ColumnarValue::Array(array) => { + if let Ok(array) = as_boolean_array(&array) { + return array.true_count() == array.len(); + } + } + ColumnarValue::Scalar(scalar) => { + if let ScalarValue::Boolean(Some(value)) = scalar { + return *value; + } + } + } + false + } + _ => false, + } + } Review Comment: Could you please also add some basic sqllogictests to ensure these code paths are hit? Or show that they are covered by existing tests? ########## datafusion/physical-expr/src/expressions/binary.rs: ########## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { use arrow::compute::kernels::numeric::*; + fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { Review Comment: If we find that this slows down some other performance we could also add some sort of heuristic check to calling `false_count` / `true_count` -- like for example if the rhs arg is "complex" (not a Column for example) ########## datafusion/physical-expr/src/expressions/binary.rs: ########## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { use arrow::compute::kernels::numeric::*; + fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { Review Comment: Thanks @acking-you -- this looks great Is there any reason to have this function defined in the `evaluate` method? I think we could just make it a normal function and reduce the nesting level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org