Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]

via GitHub Wed, 26 Mar 2025 05:57:23 -0700


LiaCastaneda commented on code in PR #15438:
URL: https://github.com/apache/datafusion/pull/15438#discussion_r2014087709



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1470,17 +1470,26 @@ impl ValuesFields {
 
 pub fn change_redundant_column(fields: &Fields) -> Vec<Field> {
     let mut name_map = HashMap::new();
+    let mut seen: HashSet<String> = HashSet::new();
+
     fields
         .into_iter()
         .map(|field| {
-            let counter = 
name_map.entry(field.name().to_string()).or_insert(0);
-            *counter += 1;
-            if *counter > 1 {
-                let new_name = format!("{}:{}", field.name(), *counter - 1);
-                Field::new(new_name, field.data_type().clone(), 
field.is_nullable())
-            } else {
-                field.as_ref().clone()

Review Comment:
   This seems to be the root cause of the issue: when doing joins, there is a 
function 
[requalify_sides_if_needed](https://github.com/DataDog/datafusion/blob/7299d0e566caa1e10f47a74b8ae817b6fb146fdf/datafusion/substrait/src/logical_plan/consumer.rs#L1795)
 to handle aliasing the columns so the resulting schema of a join : `let 
in_join_schema = left.schema().join(right.schema())?;` can be created . 
However, if we had a query like:
   
   ```
   select
    *
   from
     first_agg
     LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id
     LEFT JOIN second_agg  ON first_agg.id = second_agg.id
     LEFT JOIN third_agg  ON first_agg.id = third_agg.id
   ```
   The first JOIN to be converted to a logical plan: `LEFT JOIN third_agg  ON 
first_agg.id = third_agg.id` will work, the  join schema col names will stay as 
they are with an alias , however on the subsequent JOINs it will fail since the 
consumer does the following steps for each JOIN:
   1. After handling the innermost join the resulting join schema is `[left.id] 
[right.id]`  ✅ 
   2. For the second join it we ["carry" the previous 
schema](https://github.com/DataDog/datafusion/blob/7299d0e566caa1e10f47a74b8ae817b6fb146fdf/datafusion/substrait/src/logical_plan/consumer.rs#L1262),
 so in `requalify_sides_if_needed` we would have  `[id, left.id] [id, 
right.id]` so we would have to alias again -> `[left.id, left.id] [right.id, 
right.id]` and because of this function we would end up having: `[left.id:1 , 
left.id] [right.id:1 , right.id]` ✅ 
   3. On the outermost and final join the process would be repeated: `[id, 
left.id:1 , left.id] [id, right.id:1 , right.id]` -> 
   `[left.id:1, left.id:1 , left.id] [right.id:1, right.id:1 , right.id]` and 
because of id:1 being repeated with the current `change_redundant_column` 
algorithm, the query will fail with `Schema contains duplicate unqualified 
field name "id:1"` 🟥 
   
   Moreover we can observe that if we do just two levels of joins we would get 
no error:
   
   ```
   select
   *
     from
   first_agg
     LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id
     LEFT JOIN second_agg  ON first_agg.id = second_agg.id
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]

Reply via email to