g3blv opened a new issue, #15136:
URL: https://github.com/apache/datafusion/issues/15136
### Describe the bug
I'm trying to perform a simple JOIN between two tables in DataFusion, but I
keep hitting an error during physical optimization. I have two RecordBatch
objects that I've registered in the context, but when I try to join them, I get
a schema mismatch error even though the fields look compatible.
## What I did
I created two RecordBatch objects with the following schemas:
```rust
// First table (sources)
Schema {
fields: [
Field { name: "id", data_type: Utf8, nullable: true, ... },
Field { name: "created", data_type: Utf8, nullable: true, ... },
Field { name: "title", data_type: Utf8, nullable: true, ... },
Field { name: "uri", data_type: Utf8, nullable: true, ... },
],
metadata: { "table_name": "sources" },
}
// Second table (media)
Schema {
fields: [
Field { name: "id", data_type: Int64, nullable: true, ... },
Field { name: "created", data_type: Utf8, nullable: true, ... },
Field { name: "published", data_type: Utf8, nullable: true, ... },
Field { name: "title", data_type: Utf8, nullable: true, ... },
Field { name: "description", data_type: Utf8, nullable: true, ... },
Field { name: "mime_type", data_type: Utf8, nullable: true, ... },
Field { name: "file", data_type: Utf8, nullable: true, ... },
Field { name: "episode", data_type: Int64, nullable: true, ... },
Field { name: "uri", data_type: Utf8, nullable: true, ... },
Field { name: "source_id", data_type: Utf8, nullable: true, ... },
],
metadata: { "table_name": "media" },
}
```
I registered both tables in the context and confirmed they were there:
```rust
for record_batch in record_batches {
let table_name = record_batch
.schema()
.metadata()
.get("table_name")
.unwrap_or(&"unknown_table".to_string())
.clone();
ctx.register_batch(&table_name, record_batch)?;
}
let catalog = ctx.catalog("datafusion").unwrap();
let schema = catalog.schema("public").unwrap();
let tables = schema.table_names();
println!("Registered tables: {:?}", tables);
// This prints: Registered tables: ["sources", "media"]
```
Then I tried to execute a simple JOIN query:
```rust
ctx.sql("SELECT sources.id, media.title FROM sources JOIN media ON
sources.id = media.source_id").await?
```
## The error I'm getting
```
DataFusion error: Internal error: PhysicalOptimizer rule 'join_selection'
failed. Schema mismatch.
Expected original schema: Schema {
fields: [
Field { name: "id", data_type: Utf8, nullable: true, dict_id: 0,
dict_is_ordered: false, metadata: {} },
Field { name: "title", data_type: Utf8, nullable: true, dict_id: 0,
dict_is_ordered: false, metadata: {} }
],
metadata: {"table_name": "media"}
},
got new schema: Schema {
fields: [
Field { name: "id", data_type: Utf8, nullable: true, dict_id: 0,
dict_is_ordered: false, metadata: {} },
Field { name: "title", data_type: Utf8, nullable: true, dict_id: 0,
dict_is_ordered: false, metadata: {} }
],
metadata: {"table_name": "sources"}
}.
This was likely caused by a bug in DataFusion's code and we would welcome
that you file an bug report in our issue tracker
```
I'm relatively new to DataFusion, so I might be doing something wrong, but
based on the error message it seems like a potential bug.
## My environment
- DataFusion version: 46.0
- Rust version: rustc 1.82.0
- OS: Fedora Workstation 41
Let me know if you need any additional info to help diagnose this. Thanks!
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]