berkaysynnada commented on code in PR #13805: URL: https://github.com/apache/datafusion/pull/13805#discussion_r1888413179
########## datafusion/physical-expr/src/equivalence/properties.rs: ########## @@ -3651,4 +3661,38 @@ mod tests { sort_expr } + + #[test] + fn test_union_constant_value_preservation() -> Result<()> { + let schema = Arc::new(Schema::new(vec![ + Field::new("a", DataType::Int32, true), + Field::new("b", DataType::Int32, true), + ])); + + let col_a = col("a", &schema)?; + let literal_10 = ScalarValue::Int32(Some(10)); + + // Create first input with a=10 + let const_expr1 = + ConstExpr::new(Arc::clone(&col_a)).with_value(literal_10.clone()); + let input1 = EquivalenceProperties::new(Arc::clone(&schema)) + .with_constants(vec![const_expr1]); + + // Create second input with a=10 + let const_expr2 = + ConstExpr::new(Arc::clone(&col_a)).with_value(literal_10.clone()); + let input2 = EquivalenceProperties::new(Arc::clone(&schema)) + .with_constants(vec![const_expr2]); + + // Calculate union properties + let union_props = calculate_union(vec![input1, input2], schema)?; + + // Verify column 'a' remains constant with value 10 + let const_a = &union_props.constants()[0]; + assert!(const_a.expr().eq(&col_a)); + assert!(const_a.across_partitions()); + assert_eq!(const_a.value(), Some(&literal_10)); + + Ok(()) + } } Review Comment: Hey @alamb, I tried it but after thinking more, we actually need one more step in planner to experience an end-to-end difference. Now we have the knowledge, but we are not using it. 2 possible optimizations are which comes to my mind now: Let's assume we have: ``` # Constant value tracking across union query TT explain SELECT * FROM( ( SELECT * FROM aggregate_test_100 WHERE c1='a' ) UNION ALL ( SELECT * FROM aggregate_test_100 WHERE c1='a' )) ORDER BY c1 ---- + physical_plan + 01)SortPreservingMergeExec: [c1@0 ASC NULLS LAST] + 02)--UnionExec + 03)----CoalesceBatchesExec: target_batch_size=2 + 04)------FilterExec: c1@0 = a + 05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 + 06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true + 07)----CoalesceBatchesExec: target_batch_size=2 + 08)------FilterExec: c1@0 = a + 09)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 + 10)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true ``` 1) At the top of the plan, we see an SPM. However, it can have a CoalescePartitionsExec instead. That would improve the performance for sure. 2) For the same query without an order by but with another outer filter, we will see another filter. However, we can actually remove that. This is another optimization, but can be observed pretty rarely rather than 1st one. 2nd one could be not really realistic, but the first one could be implemented without much effort with a few changes in replace_with_order_preserving_variants scope. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org