LiaCastaneda commented on code in PR #15438: URL: https://github.com/apache/datafusion/pull/15438#discussion_r2014087709
########## datafusion/expr/src/logical_plan/builder.rs: ########## @@ -1470,17 +1470,26 @@ impl ValuesFields { pub fn change_redundant_column(fields: &Fields) -> Vec<Field> { let mut name_map = HashMap::new(); + let mut seen: HashSet<String> = HashSet::new(); + fields .into_iter() .map(|field| { - let counter = name_map.entry(field.name().to_string()).or_insert(0); - *counter += 1; - if *counter > 1 { - let new_name = format!("{}:{}", field.name(), *counter - 1); - Field::new(new_name, field.data_type().clone(), field.is_nullable()) - } else { - field.as_ref().clone() Review Comment: This seems to be the root cause of the issue: when doing joins, there is a function [requalify_sides_if_needed](https://github.com/DataDog/datafusion/blob/7299d0e566caa1e10f47a74b8ae817b6fb146fdf/datafusion/substrait/src/logical_plan/consumer.rs#L1795) to handle aliasing the columns so the resulting schema of a join : `let in_join_schema = left.schema().join(right.schema())?;` can be created . However, if we had a query like: ``` select * from first_agg LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id LEFT JOIN second_agg ON first_agg.id = second_agg.id LEFT JOIN third_agg ON first_agg.id = third_agg.id ``` The first JOIN to be converted to a logical plan: `LEFT JOIN third_agg ON first_agg.id = third_agg.id` will work, the join schema col names will stay as they are with an alias , however on the subsequent JOINs it will fail since the consumer does the following steps for each JOIN: 1. After handling the innermost join the resulting join schema is `[left.id] [right.id]` ✅ 2. For the second join it we ["carry" the previous schema](https://github.com/DataDog/datafusion/blob/7299d0e566caa1e10f47a74b8ae817b6fb146fdf/datafusion/substrait/src/logical_plan/consumer.rs#L1262), so in `requalify_sides_if_needed` we would have `[id, left.id] [id, right.id]` so we would have to alias again -> `[left.id, left.id] [right.id, right.id]` and because of this function we would end up having: `[left.id:1 , left.id] [right.id:1 , right.id]` ✅ 3. On the outermost and final join the process would be repeated: `[id, left.id:1 , left.id] [id, right.id:1 , right.id]` -> `[left.id:1, left.id:1 , left.id] [right.id:1, right.id:1 , right.id]` and because of id:1 being repeated with the current `change_redundant_column` algorithm, the query will fail with `Schema contains duplicate unqualified field name "id:1"` 🟥 Moreover we can observe that if we do just two levels of joins we would get no error: ``` select * from first_agg LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id LEFT JOIN second_agg ON first_agg.id = second_agg.id ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org