Re: [PR] Preserve constant values across union operations [datafusion]

via GitHub Tue, 17 Dec 2024 04:12:01 -0800


berkaysynnada commented on code in PR #13805:
URL: https://github.com/apache/datafusion/pull/13805#discussion_r1888413179



##########
datafusion/physical-expr/src/equivalence/properties.rs:
##########
@@ -3651,4 +3661,38 @@ mod tests {
 
         sort_expr
     }
+
+    #[test]
+    fn test_union_constant_value_preservation() -> Result<()> {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Int32, true),
+            Field::new("b", DataType::Int32, true),
+        ]));
+
+        let col_a = col("a", &schema)?;
+        let literal_10 = ScalarValue::Int32(Some(10));
+
+        // Create first input with a=10
+        let const_expr1 =
+            ConstExpr::new(Arc::clone(&col_a)).with_value(literal_10.clone());
+        let input1 = EquivalenceProperties::new(Arc::clone(&schema))
+            .with_constants(vec![const_expr1]);
+
+        // Create second input with a=10
+        let const_expr2 =
+            ConstExpr::new(Arc::clone(&col_a)).with_value(literal_10.clone());
+        let input2 = EquivalenceProperties::new(Arc::clone(&schema))
+            .with_constants(vec![const_expr2]);
+
+        // Calculate union properties
+        let union_props = calculate_union(vec![input1, input2], schema)?;
+
+        // Verify column 'a' remains constant with value 10
+        let const_a = &union_props.constants()[0];
+        assert!(const_a.expr().eq(&col_a));
+        assert!(const_a.across_partitions());
+        assert_eq!(const_a.value(), Some(&literal_10));
+
+        Ok(())
+    }
 }

Review Comment:
   Hey @alamb, I tried it but after thinking more, we actually need one more 
step in planner to experience an end-to-end difference. Now we have the 
knowledge, but we are not using it. 2 possible optimizations are which comes to 
my mind now:
   Let's assume we have:
   ```
   # Constant value tracking across union
   query TT
   explain
   SELECT * FROM(
   (
       SELECT * FROM aggregate_test_100 WHERE c1='a'
   )
   UNION ALL
   (
       SELECT * FROM aggregate_test_100 WHERE c1='a'
   ))
   ORDER BY c1
   ----
   +   physical_plan
   +   01)SortPreservingMergeExec: [c1@0 ASC NULLS LAST]
   +   02)--UnionExec
   +   03)----CoalesceBatchesExec: target_batch_size=2
   +   04)------FilterExec: c1@0 = a
   +   05)--------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   +   06)----------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, 
c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true
   +   07)----CoalesceBatchesExec: target_batch_size=2
   +   08)------FilterExec: c1@0 = a
   +   09)--------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   +   10)----------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, 
c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13], has_header=true
   ```
   1) At the top of the plan, we see an SPM. However, it can have a 
CoalescePartitionsExec instead. That would improve the performance for sure.
   2) For the same query without an order by but with another outer filter, we 
will see another filter. However, we can actually remove that. This is another 
optimization, but can be observed pretty rarely rather than 1st one.
   
   2nd one could be not really realistic, but the first one could be 
implemented without much effort with a few changes in 
replace_with_order_preserving_variants scope.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Preserve constant values across union operations [datafusion]

Reply via email to