Blizzara commented on code in PR #17299:
URL: https://github.com/apache/datafusion/pull/17299#discussion_r2330926518


##########
datafusion/substrait/src/logical_plan/consumer/rel/project_rel.rs:
##########
@@ -62,7 +62,17 @@ pub async fn from_project_rel(
                 // to transform it into a column reference
                 window_exprs.insert(e.clone());
             }
-            explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
+            // Since substrait removes aliases, we need to assign literals 
with a UUID alias to avoid
+            // ambiguous names when the same literal is used before and after 
a join.
+            // The name tracker will ensure that two literals in the same 
project would have
+            // unique names but, it does not ensure that if a literal column 
exists in a previous
+            // project say before a join that it is deduplicated with respect 
to those columns.

Review Comment:
   Fine by me.
   
   FWIW, I looked a bit at what it'd take to fix the tracker. I think a core of 
the issue is that DF checks name ambiguity in two ways: there's the 
AmbiguousColumn exception you're running into, and then there is a 
`validate_unique_names()` function which gets called on the creation of the 
Project. The former needs unique non-qualified names, while the latter needs 
unique schema names (which _can_ be qualified). 
   
   An easy fix for the former would be to change `name_for_alias()` into 
`qualified_name()._1` here 
https://github.com/apache/datafusion/blob/1d9e13845021c2e82a012c2e83938c2a7661f295/datafusion/substrait/src/logical_plan/consumer/utils.rs#L398.
 However, that then regresses the latter check (including in the test case for 
this PR), since there will then be a project node with an expr `CAST(B.C as 
Utf8)` with a qualified name ([no qualifier], "B.C") and a schema name "B.C", 
as well as a reference to the original column `B.C` with a qualified name ("B", 
"C") and also schema name "B.C". As the qualified name's name parts are 
different, it wouldn't be renamed (after the change I propose), and then it'd 
fail the `validate_unique_names()` check. So maybe for a proper fix, 
NameTracker would need to track **both the schema name and the name-part of the 
qualified name**, and rename until both are unique.
   
   (A simple example of the behavior of the CAST and validate_unique_names() is 
that `SELECT data.a, CAST(data.a as string) from data;` also fails in 
datafusion-cli.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to