[I] Duplicate Unqualified Field Name [datafusion]

via GitHub Thu, 20 Feb 2025 10:10:47 -0800


LiaCastaneda opened a new issue, #14799:
URL: https://github.com/apache/datafusion/issues/14799


   ### Describe the bug
   
   I have encountered an error while doing a query with a join: 
   
   `failed to optimize plan: Internal error: Input field name count(Int64(1)) 
does not match with the projection expression count(Int64(1)):1`
   
   This occurs when I do a join and the resulting schema has more than 2 fields 
with a count(*) (or any aggregator), I understand that datafusion internally 
tries to rename the column names to be able to create a DFSchema, since arrow 
does not support duplicate names. While debugging I saw that what it does 
internally is:
   
   1. The name tracker sees `count(Int64(1))`
   2. The name traccker sees `count(Int64(1))`again and renames it to 
`count(Int64(1)):1`
   3. The name traccker sees `count(Int64(1))`again and renames it to 
`count(Int64(1)):2` or `count(Int64(1)):1` again?
   
   I have another related error that arises when trying to create the logical 
plan that is:
   `failed to translate modified plan to DataFusion: Schema error: Schema 
contains duplicate unqualified field name "count(Int64(1)):1"`
   
   On my local clone I managed to make datafusion generate a logical plan by 
tweaking [this function 
](https://github.com/apache/datafusion/blob/84232d806070f4b177e35148255c501a5f6e740b/datafusion/expr/src/logical_plan/builder.rs#L1392)
 but then when trying to build the physical plan I get the error I mentioned 
first.
   
   ### To Reproduce
   
   ```
   CREATE TABLE left_table (
       id INT PRIMARY KEY,
       category TEXT,
       timestamp TIMESTAMP
   );
   
   CREATE TABLE right_table (
       id INT PRIMARY KEY,
       category TEXT,
       timestamp TIMESTAMP
   );
   
   INSERT INTO left_table (id, category, timestamp) VALUES
   (1, 'business_logic', '2024-02-15 10:00:00'),
   (2, 'attack_attempt', '2024-02-15 10:05:00');
   
   INSERT INTO right_table (id, category, timestamp) VALUES
   (1, 'info', '2024-02-15 10:10:00'),
   (2, 'low', '2024-02-15 10:15:00');
   
   WITH first_agg AS (
       SELECT id, COUNT(*) AS count_first FROM left_table GROUP BY id
   ),
   second_agg AS (
       SELECT id, COUNT(*) AS count_second FROM right_table GROUP BY id
   ),
   
   third_agg AS (
       SELECT id, COUNT(*) AS count_third FROM right_table GROUP BY id
   ),
   fourth_random_table AS (
       SELECT id, category FROM right_table GROUP BY id, category
   )
   
   select count_first, category, count_second, count_third from first_agg 
     LEFT JOIN fourth_random_table using (id) 
     LEFT JOIN  second_agg using (id)
     LEFT JOIN third_agg using  (id) `
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   I'm using version 43.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Duplicate Unqualified Field Name [datafusion]

Reply via email to