fullstart opened a new issue, #14147:
URL: https://github.com/apache/datafusion/issues/14147

   ### Describe the bug
   
   Encountered an issue joining dataframes with duplicate column names if they 
generated from file read (I tried csv and parquet).
   Dataframes produced from python dict do join without problem.
   
   I did my testing with latest version of Datafusion on Windows.
   
   ### To Reproduce
   
   Fine with dataframes from dict
   ```
   from datafusion import SessionContext
   ctx = SessionContext()
   x1 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 
'col3': [3, 4, 1, 2, 3]})
   x2 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 
'col3': [5, 6, 7, 8, 9]})
   x1.join(x2, on="id1")
   Out[16]:
   DataFrame()
   +-----+------+------+-----+------+------+
   | id1 | col2 | col3 | id1 | col2 | col3 |
   +-----+------+------+-----+------+------+
   | 1   | 3    | 3    | 1   | 3    | 5    |
   | 2   | 4    | 4    | 2   | 4    | 6    |
   | 4   | 3    | 1    | 4   | 3    | 7    |
   | 5   | 5    | 2    | 5   | 5    | 8    |
   | 6   | 2    | 3    | 6   | 2    | 9    |
   +-----+------+------+-----+------+------+
   ```
   
   Continue to file read
   ```
   x1.write_csv("df1.csv")
   x2.write_csv("df2.csv")
   
   x1_f = ctx.read_csv("df1.csv")
   x2_f = ctx.read_csv("df2.csv")
   
   x1_f.join(x2_f, on="id1")
   ---------------------------------------------------------------------------
   Exception                                 Traceback (most recent call last)
   Cell In[21], line 1
   ----> 1 x1_f.join(x2_f, on="id1")
   
   File 
~\prj\datafusion_test\venv\Lib\site-packages\datafusion\dataframe.py:468, in 
DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
       465 if isinstance(right_on, str):
       466     right_on = [right_on]
   --> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))
   
   Exception: Schema error: No field named id1. Valid fields are "?table?"."1", 
"?table?"."3".
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to