LiaCastaneda commented on code in PR #15438:
URL: https://github.com/apache/datafusion/pull/15438#discussion_r2014087709
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1470,17 +1470,26 @@ impl ValuesFields {
pub fn change_redundant_column(fields: &Fields) -> Vec<Field> {
let mut name_map = HashMap::new();
+ let mut seen: HashSet<String> = HashSet::new();
+
fields
.into_iter()
.map(|field| {
- let counter =
name_map.entry(field.name().to_string()).or_insert(0);
- *counter += 1;
- if *counter > 1 {
- let new_name = format!("{}:{}", field.name(), *counter - 1);
- Field::new(new_name, field.data_type().clone(),
field.is_nullable())
- } else {
- field.as_ref().clone()
Review Comment:
This seems to be the root cause of the issue: when doing joins, there is a
function
[requalify_sides_if_needed](https://github.com/DataDog/datafusion/blob/7299d0e566caa1e10f47a74b8ae817b6fb146fdf/datafusion/substrait/src/logical_plan/consumer.rs#L1795)
to handle aliasing the columns so the resulting schema of a join : `let
in_join_schema = left.schema().join(right.schema())?;` can be created .
However, if we had a query like:
```
select
*
from
first_agg
LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id
LEFT JOIN second_agg ON first_agg.id = second_agg.id
LEFT JOIN third_agg ON first_agg.id = third_agg.id
```
The first JOIN to be converted to a logical plan: `LEFT JOIN third_agg ON
first_agg.id = third_agg.id` will work, the join schema col names will stay as
they are with an alias , however on the subsequent JOINs it will fail since the
consumer does the following steps for each JOIN:
1. After handling the innermost join the resulting join schema is `[left.id]
[right.id]` ✅
2. For the second join it we ["carry" the previous
schema](https://github.com/DataDog/datafusion/blob/7299d0e566caa1e10f47a74b8ae817b6fb146fdf/datafusion/substrait/src/logical_plan/consumer.rs#L1262),
so in `requalify_sides_if_needed` we would have `[id, left.id] [id,
right.id]` so we would have to alias again -> `[left.id, left.id] [right.id,
right.id]` and because of this function we would end up having: `[left.id:1 ,
left.id] [right.id:1 , right.id]` ✅
3. On the outermost and final join the process would be repeated: `[id,
left.id:1 , left.id] [id, right.id:1 , right.id]` ->
`[left.id:1, left.id:1 , left.id] [right.id:1, right.id:1 , right.id]` and
because of id:1 being repeated with the current `change_redundant_column`
algorithm, the query will fail with `Schema contains duplicate unqualified
field name "id:1"` 🟥
Moreover we can observe that if we do just two levels of joins we would get
no error:
```
select
*
from
first_agg
LEFT JOIN fourth_random_table ON first_agg.id = fourth_random_table.id
LEFT JOIN second_agg ON first_agg.id = second_agg.id
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]