raghav-reglobe commented on PR #22988:
URL: https://github.com/apache/datafusion/pull/22988#issuecomment-4878506556

   Took this branch for a spin against a provider-side consumer (we're building 
an Iceberg SCD2 upsert on this hook), and I think there's one blocking issue 
the current tests don't reach: **any target-column reference in the `ON` or 
`WHEN` expressions fails analysis**, so a realistic MERGE can't execute through 
`SessionContext::sql()`.
   
   Minimal repro (custom provider registered as `t`, MemTable as `batch`):
   
   ```sql
   MERGE INTO t USING batch s ON t.id = s.id
   WHEN MATCHED THEN UPDATE SET val = s.val
   ```
   ```
   Error: Context("type_coercion", SchemaError(FieldNotFound { field: Column { 
relation: Some(Bare { table: "t" }), name: "id" }, valid_fields: [s.id, s.val, 
…] }))
   ```
   
   What's happening: the `Dml` node's only input is the planned USING source, 
but `MergeIntoOp`'s expressions reference the combined target×source namespace. 
Since this PR wires those expressions into the generic tree-node traversal 
(`tree_node.rs`), the analyzer's `TypeCoercion` pass rewrites them against the 
node's input schema — which is source-only — and errors on the first target 
reference. As a probe, `ON s.id = s.id` (source-only) sails through analysis 
and reaches `merge_into()` fine, which pins the root cause. Everything else 
held up nicely in testing, for what it's worth — a window-function subquery as 
the USING source (`ROW_NUMBER`/`LEAD`), multi-condition `ON`, clause 
predicates, and the INSERT validation all plan exactly as expected.
   
   Two directions I can see:
   
   1. **Minimal**: don't expose `MergeIntoOp` exprs through the generic 
traversal (keep `exprs()`/`with_new_exprs` as inherent methods only). The 
analyzer/optimizer then leave the node alone, and providers receive un-coerced 
expressions — acceptable for providers that do their own resolution, and it 
ships a working v1.
   2. **Fuller**: record the target's resolution qualifier on `MergeIntoOp` 
(the alias when present, else the table reference — today the alias is dropped 
after planning), and teach `TypeCoercion` to resolve MergeInto exprs against 
target-schema ⋈ input-schema. Coercion on these expressions genuinely matters 
(e.g. an `Int32` target key against an `Int64` window output in `ON`), and the 
recorded qualifier also gives providers an explicit way to tell target 
references from source references instead of inferring by absence.
   
   I'd lean toward 2, but 1 seems fine to unblock if you'd rather keep this PR 
small. Happy to share the test module I used — a `CaptureMergeProvider` in the 
style of `dml_planning.rs` with basic + SCD2-shaped (window-subquery USING, 
triple-condition ON, clause predicate) cases — if it's useful as a starting 
point for end-to-end coverage here. Also happy to put up a patch for either 
direction if that helps.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to