[I] Possible hidden schema mismatch for HashJoin in ProjectionExec [datafusion]

via GitHub Thu, 05 Dec 2024 16:48:18 -0800


jayzhan211 opened a new issue, #13673:
URL: https://github.com/apache/datafusion/issues/13673


   ### Describe the bug
   
   The input schema and schema for projection are different in HashJoin cases 
because we might add additional alias project for hash join. If the schema has 
no *additional projected column* that the input schema has, it is possible to 
cause schema mismatch. In `stats_projection()`, because we ignore the error so 
we don't encounter such issue but I think we should not eat the error but fix 
the schema and column index instead.
   
   
https://github.com/apache/datafusion/blob/2464703c84c400a09cc59277018813f0e797bb4e/datafusion/physical-plan/src/projection.rs#L278-L283
   
   
   ### To Reproduce
   
   Take this query for example,
   
   ```
   statement count 0
   create table t1(name varchar, id int) as values ('df', 1), ('rs', 2), ('po', 
3);
   
   statement ok
   create table t2(product varchar, sid bigint) as values ('er', 1), ('sd', 2), 
('ge', 3);
   
   query TT
   select t1.name, t2.product from t1 inner join t2 on t1.id = t2.sid;
   ----
   df er
   rs sd
   po ge
   ```
   
   If we add the code to `ProjectExec::try_new()`
   ```rust
           // Construct a map from the input expressions to the output 
expression of the Projection
           let projection_mapping = ProjectionMapping::try_new(&expr, 
&input_schema)?;
           let cache =
               Self::compute_properties(&input, &projection_mapping, 
Arc::clone(&schema))?;
           
           println!("schema: {:?}", schema);
           println!("input schema: {:?}", input_schema);
           for (e, _) in expr.iter() {
               // panic here
               let _dt = e.data_type(&schema).unwrap();
           }
   
           Ok(Self {
               expr,
               schema,
               input,
               metrics: ExecutionPlanMetricsSet::new(),
               cache,
           })
   ```
   
   we can see that `input_schema` has additional column `CAST(t1.id AS Int64)`
   ```
   Schema {
     fields: [
       Field {
         name: "name",
         data_type: Utf8,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "id",
         data_type: Int32,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "product",
         data_type: Utf8,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "sid",
         data_type: Int64,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       }
     ],
     metadata: {}
   }
   
   Input Schema {
     fields: [
       Field {
         name: "name",
         data_type: Utf8,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "id",
         data_type: Int32,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "CAST(t1.id AS Int64)",
         data_type: Int64,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "product",
         data_type: Utf8,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       },
       Field {
         name: "sid",
         data_type: Int64,
         nullable: true,
         dict_id: 0,
         dict_is_ordered: false,
         metadata: {}
       }
     ],
     metadata: {}
   }
   
   ```
   
   If we use the `schema` for projection, we have the column mismatch index 
that could cause Column's `bounds_check`.
   
   The error is because of the out of bound index. `sid` is created as index 4 
for `input_schema` but the schema we used in projection has only 4 columns.
   ```
   thread 'tokio-runtime-worker' panicked at 
datafusion/physical-plan/src/projection.rs:102:44:
   called `Result::unwrap()` on an `Err` value: Internal("PhysicalExpr Column 
references column 'sid' at index 4 (zero-based) but input schema only has 4 
columns: [\"name\", \"id\", \"product\", \"sid\"]")
   ```
   
   ### Expected behavior
   
   I think we might either recreate expressions with the new index that match 
the `schema` or use the `input_schema` for projection. I'm not sure which is 
the right choice yet.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Possible hidden schema mismatch for HashJoin in ProjectionExec [datafusion]

Reply via email to