killme2008 opened a new issue, #20544:
URL: https://github.com/apache/datafusion/issues/20544

   ### Describe the bug
   
   When an `INSERT INTO ... VALUES` statement contains a mix of bare 
placeholders (`$1`, `$3`) and function-wrapped placeholders (`func($2)`), the 
inferred types for placeholders can be incorrectly assigned due to an index 
corruption bug in 
[insert_to_plan](https://github.com/apache/datafusion/blob/e937cadbcceff6a42bee2c5fc8d03068fa0eb30c/datafusion/sql/src/statement.rs#L2257).
   
   It scans the `VALUES` clause and collects `placeholder-to-column-type` 
mappings into a `BTreeMap<usize, FieldRef>`. Only direct placeholders are 
matched — placeholders nested inside function calls are skipped. The `BTreeMap 
is` then converted to a `Vec` via `into_values().collect()`, which discards the 
keys and compresses the sparse entries into a dense Vec.
   
   For example, given:
   
   ```sql
     INSERT INTO t (id, name, age) VALUES ($1, upper($2), $3)
     -- columns: id(UInt32), name(Utf8), age(Int32)
   ```
   
   1. The scan produces `BTreeMap { 0: UInt32, 2: Int32 }` (index 1 skipped 
because $2 is inside upper())
   2. `into_values().collect()` produces `Vec [UInt32, Int32]` — indices 0 and 
2 are collapsed to 0 and 1
   3. When `create_placeholder_expr` resolves $2, it computes `idx = 2 - 1 = 
1`, then looks up Vec[1] = Int32
   4. `$2` is incorrectly inferred as `Int32` (the type of age) instead of 
being left as unknown (None)
   
   This causes downstream type-checking failures. In real-world usage (e.g., 
GreptimeDB https://github.com/GreptimeTeam/greptimedb/issues/7610), `INSERT 
INTO t VALUES ($1, parse_json($2), $3)` fails because `$2` is wrongly inferred 
as `Timestamp` instead of `String`, making `parse_json(Timestamp)` fail type 
validation.
   
   # Fix
   
   Replace into_values().collect() with a conversion that preserves index 
positions, filling gaps with None:
   
   ```rust
     // Before (broken):
     let prepare_param_data_types = 
prepare_param_data_types.into_values().collect();
   
     // After (correct):
     let prepare_param_data_types: Vec<Option<FieldRef>> =
         if let Some(&max_idx) = prepare_param_data_types.keys().last() {
             (0..=max_idx)
                 .map(|i| prepare_param_data_types.remove(&i))
                 .collect()
         } else {
             vec![]
         };
   ```
   
   This requires changing the `prepare_param_data_types` type from 
`Vec<FieldRef>` to `Vec<Option<FieldRef>>` throughout the PlannerContext API to 
properly represent "type unknown for this placeholder position".
   
   I'd love to create a PR for this fix if you agree.
   
   Thanks!
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to