Re: [PR] Fix UNION field nullability tracking [datafusion]

via GitHub Fri, 31 Jan 2025 03:20:08 -0800


findepi commented on code in PR #14356:
URL: https://github.com/apache/datafusion/pull/14356#discussion_r1937072353



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -2645,6 +2643,106 @@ pub struct Union {
     pub schema: DFSchemaRef,
 }
 
+impl Union {
+    /// Constructs new Union instance deriving schema from inputs.
+    fn try_new(inputs: Vec<Arc<LogicalPlan>>) -> Result<Self> {
+        let schema = Self::derive_schema_from_inputs(&inputs, false)?;
+        Ok(Union { inputs, schema })
+    }
+
+    /// Constructs new Union instance deriving schema from inputs.
+    /// Inputs do not have to have matching types and produced schema will
+    /// take type from the first input.
+    pub fn try_new_with_loose_types(inputs: Vec<Arc<LogicalPlan>>) -> 
Result<Self> {
+        let schema = Self::derive_schema_from_inputs(&inputs, true)?;
+        Ok(Union { inputs, schema })
+    }
+
+    /// Constructs new Union instance deriving schema from inputs.
+    ///
+    /// `loose_types` if true, inputs do not have to have matching types and 
produced schema will
+    /// take type from the first input. TODO this is not necessarily 
reasonable behavior.
+    fn derive_schema_from_inputs(
+        inputs: &[Arc<LogicalPlan>],
+        loose_types: bool,
+    ) -> Result<DFSchemaRef> {
+        if inputs.len() < 2 {

Review Comment:
   > f I want to construct a UNION logical plan with different types that are 
coercible (be it by current builtin rules or future user-defined rules), then I 
would use the `Union::try_new_with_loose_types` and have the analyzer pass 
handle coercion. Is this right?
   
   Correct.
   Note: this is not my design. It was exactly the same before the PR. Just the 
code moved around.
   
   > Then what exactly is the use case for the Union::try_new? Since it's used 
in the schema recompute which can occur after the analyzer type coercion.
   
   "schema recompute" is an overloaded term
   if it runs after analyzer, it doesn't have to do any type coercion. In fact, 
it MUST NOT do any type coercion 
(https://github.com/apache/datafusion/issues/14296#issuecomment-2625773366).  
And in fact the `try_new` does not do any coercions. It's still needed to do 
column pruning.
   In fact, IMO we should remove "schema recompute" from optimizer: 
https://github.com/apache/datafusion/issues/14357. For column pruning we should 
explicitly prune inputs of union and the unin itself using the same set of 
"required columns/indices". No need for a generic "recompute schema".
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix UNION field nullability tracking [datafusion]

Reply via email to