lesam opened a new issue, #15754:
URL: https://github.com/apache/datafusion/issues/15754

   ### Describe the bug
   
   I believe this is a regression as the reproducer case works on 
`datafusion-42.0.0` but fails on `datafusion-46.0.0`.
   
   Physical planning fails, I think due to schema mismatch (looks like there's 
a related epic)?
   
   ### To Reproduce
   
   Code:
   
   ```
   import pandas as pd
   from datafusion import SessionContext
   
   long = pd.DataFrame({"value_1": [0.1, 0.2, 0.3, 0.4], "key_1":[0,0,1,1]})
   short = pd.DataFrame({"value_2": ["a", "b"], "key_2":[0,1]})
   context = SessionContext()
   context.from_pandas(long, name="long")
   context.from_pandas(short, name="short")
   
   context.sql("select * from long").show()
   context.sql("select * from short").show()
   
   context.sql("select value_1, value_2 from long left join short on key_1 == 
key_2").show()
   ```
   
   Output:
   
   ```
   DataFrame()
   +---------+-------+
   | value_1 | key_1 |
   +---------+-------+
   | 0.1     | 0     |
   | 0.2     | 0     |
   | 0.3     | 1     |
   | 0.4     | 1     |
   +---------+-------+
   DataFrame()
   +---------+-------+
   | value_2 | key_2 |
   +---------+-------+
   | a       | 0     |
   | b       | 1     |
   +---------+-------+
   Traceback (most recent call last):
     File "/opt/project/test.py", line 13, in <module>
       context.sql("select value_1, value_2 from long left join short on key_1 
== key_2").show()
     File "/usr/local/lib/python3.10/site-packages/datafusion/dataframe.py", 
line 440, in show
       self.df.show(num)
   Exception: DataFusion error: Internal("PhysicalOptimizer rule 
'join_selection' failed. Schema mismatch. Expected original schema: Schema { 
fields: [Field { name: \"value_1\", data_type: Float64, nullable: true, 
dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"value_2\", 
data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: 
{} }], metadata: {\"pandas\": \"{\\\"index_columns\\\": [{\\\"kind\\\": 
\\\"range\\\", \\\"name\\\": null, \\\"start\\\": 0, \\\"stop\\\": 2, 
\\\"step\\\": 1}], \\\"column_indexes\\\": [{\\\"name\\\": null, 
\\\"field_name\\\": null, \\\"pandas_type\\\": \\\"unicode\\\", 
\\\"numpy_type\\\": \\\"object\\\", \\\"metadata\\\": {\\\"encoding\\\": 
\\\"UTF-8\\\"}}], \\\"columns\\\": [{\\\"name\\\": \\\"value_2\\\", 
\\\"field_name\\\": \\\"value_2\\\", \\\"pandas_type\\\": \\\"unicode\\\", 
\\\"numpy_type\\\": \\\"object\\\", \\\"metadata\\\": null}, {\\\"name\\\": 
\\\"key_2\\\", \\\"field_name\\\": \\\"key_2\\\", \\\"pandas_ty
 pe\\\": \\\"int64\\\", \\\"numpy_type\\\": \\\"int64\\\", \\\"metadata\\\": 
null}], \\\"creator\\\": {\\\"library\\\": \\\"pyarrow\\\", \\\"version\\\": 
\\\"19.0.1\\\"}, \\\"pandas_version\\\": \\\"2.2.2\\\"}\"} }, got new schema: 
Schema { fields: [Field { name: \"value_1\", data_type: Float64, nullable: 
true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: 
\"value_2\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: 
false, metadata: {} }], metadata: {\"pandas\": \"{\\\"index_columns\\\": 
[{\\\"kind\\\": \\\"range\\\", \\\"name\\\": null, \\\"start\\\": 0, 
\\\"stop\\\": 4, \\\"step\\\": 1}], \\\"column_indexes\\\": [{\\\"name\\\": 
null, \\\"field_name\\\": null, \\\"pandas_type\\\": \\\"unicode\\\", 
\\\"numpy_type\\\": \\\"object\\\", \\\"metadata\\\": {\\\"encoding\\\": 
\\\"UTF-8\\\"}}], \\\"columns\\\": [{\\\"name\\\": \\\"value_1\\\", 
\\\"field_name\\\": \\\"value_1\\\", \\\"pandas_type\\\": \\\"float64\\\", 
\\\"numpy_type\\\": \\\"float64\\\", \\\
 "metadata\\\": null}, {\\\"name\\\": \\\"key_1\\\", \\\"field_name\\\": 
\\\"key_1\\\", \\\"pandas_type\\\": \\\"int64\\\", \\\"numpy_type\\\": 
\\\"int64\\\", \\\"metadata\\\": null}], \\\"creator\\\": {\\\"library\\\": 
\\\"pyarrow\\\", \\\"version\\\": \\\"19.0.1\\\"}, \\\"pandas_version\\\": 
\\\"2.2.2\\\"}\"} }")
   ```
   
   ### Expected behavior
   
   ```
   DataFrame()
   +---------+-------+
   | value_1 | key_1 |
   +---------+-------+
   | 0.1     | 0     |
   | 0.2     | 0     |
   | 0.3     | 1     |
   | 0.4     | 1     |
   +---------+-------+
   DataFrame()
   +---------+-------+
   | value_2 | key_2 |
   +---------+-------+
   | a       | 0     |
   | b       | 1     |
   +---------+-------+
   DataFrame()
   +---------+---------+
   | value_1 | value_2 |
   +---------+---------+
   | 0.1     | a       |
   | 0.2     | a       |
   | 0.3     | b       |
   | 0.4     | b       |
   +---------+---------+
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to