lesam opened a new issue, #15754: URL: https://github.com/apache/datafusion/issues/15754
### Describe the bug I believe this is a regression as the reproducer case works on `datafusion-42.0.0` but fails on `datafusion-46.0.0`. Physical planning fails, I think due to schema mismatch (looks like there's a related epic)? ### To Reproduce Code: ``` import pandas as pd from datafusion import SessionContext long = pd.DataFrame({"value_1": [0.1, 0.2, 0.3, 0.4], "key_1":[0,0,1,1]}) short = pd.DataFrame({"value_2": ["a", "b"], "key_2":[0,1]}) context = SessionContext() context.from_pandas(long, name="long") context.from_pandas(short, name="short") context.sql("select * from long").show() context.sql("select * from short").show() context.sql("select value_1, value_2 from long left join short on key_1 == key_2").show() ``` Output: ``` DataFrame() +---------+-------+ | value_1 | key_1 | +---------+-------+ | 0.1 | 0 | | 0.2 | 0 | | 0.3 | 1 | | 0.4 | 1 | +---------+-------+ DataFrame() +---------+-------+ | value_2 | key_2 | +---------+-------+ | a | 0 | | b | 1 | +---------+-------+ Traceback (most recent call last): File "/opt/project/test.py", line 13, in <module> context.sql("select value_1, value_2 from long left join short on key_1 == key_2").show() File "/usr/local/lib/python3.10/site-packages/datafusion/dataframe.py", line 440, in show self.df.show(num) Exception: DataFusion error: Internal("PhysicalOptimizer rule 'join_selection' failed. Schema mismatch. Expected original schema: Schema { fields: [Field { name: \"value_1\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"value_2\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {\"pandas\": \"{\\\"index_columns\\\": [{\\\"kind\\\": \\\"range\\\", \\\"name\\\": null, \\\"start\\\": 0, \\\"stop\\\": 2, \\\"step\\\": 1}], \\\"column_indexes\\\": [{\\\"name\\\": null, \\\"field_name\\\": null, \\\"pandas_type\\\": \\\"unicode\\\", \\\"numpy_type\\\": \\\"object\\\", \\\"metadata\\\": {\\\"encoding\\\": \\\"UTF-8\\\"}}], \\\"columns\\\": [{\\\"name\\\": \\\"value_2\\\", \\\"field_name\\\": \\\"value_2\\\", \\\"pandas_type\\\": \\\"unicode\\\", \\\"numpy_type\\\": \\\"object\\\", \\\"metadata\\\": null}, {\\\"name\\\": \\\"key_2\\\", \\\"field_name\\\": \\\"key_2\\\", \\\"pandas_ty pe\\\": \\\"int64\\\", \\\"numpy_type\\\": \\\"int64\\\", \\\"metadata\\\": null}], \\\"creator\\\": {\\\"library\\\": \\\"pyarrow\\\", \\\"version\\\": \\\"19.0.1\\\"}, \\\"pandas_version\\\": \\\"2.2.2\\\"}\"} }, got new schema: Schema { fields: [Field { name: \"value_1\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"value_2\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {\"pandas\": \"{\\\"index_columns\\\": [{\\\"kind\\\": \\\"range\\\", \\\"name\\\": null, \\\"start\\\": 0, \\\"stop\\\": 4, \\\"step\\\": 1}], \\\"column_indexes\\\": [{\\\"name\\\": null, \\\"field_name\\\": null, \\\"pandas_type\\\": \\\"unicode\\\", \\\"numpy_type\\\": \\\"object\\\", \\\"metadata\\\": {\\\"encoding\\\": \\\"UTF-8\\\"}}], \\\"columns\\\": [{\\\"name\\\": \\\"value_1\\\", \\\"field_name\\\": \\\"value_1\\\", \\\"pandas_type\\\": \\\"float64\\\", \\\"numpy_type\\\": \\\"float64\\\", \\\ "metadata\\\": null}, {\\\"name\\\": \\\"key_1\\\", \\\"field_name\\\": \\\"key_1\\\", \\\"pandas_type\\\": \\\"int64\\\", \\\"numpy_type\\\": \\\"int64\\\", \\\"metadata\\\": null}], \\\"creator\\\": {\\\"library\\\": \\\"pyarrow\\\", \\\"version\\\": \\\"19.0.1\\\"}, \\\"pandas_version\\\": \\\"2.2.2\\\"}\"} }") ``` ### Expected behavior ``` DataFrame() +---------+-------+ | value_1 | key_1 | +---------+-------+ | 0.1 | 0 | | 0.2 | 0 | | 0.3 | 1 | | 0.4 | 1 | +---------+-------+ DataFrame() +---------+-------+ | value_2 | key_2 | +---------+-------+ | a | 0 | | b | 1 | +---------+-------+ DataFrame() +---------+---------+ | value_1 | value_2 | +---------+---------+ | 0.1 | a | | 0.2 | a | | 0.3 | b | | 0.4 | b | +---------+---------+ ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org