Re: [PR] fix: Set Substrait output types for expressions [datafusion]

via GitHub Tue, 19 May 2026 10:26:44 -0700


westonpace commented on PR #20597:
URL: https://github.com/apache/datafusion/pull/20597#issuecomment-4490311248


   Here's some feedback from Claude:
   
   ```
     P1 — from_join behavioral change is observable on the wire
   
     The old code combined on and filter as AND(reduce_left(on_conditions), 
filter). The new code uses conjunction(on_conditions.chain(filter)), which 
folds right in DataFusion's conjunction
     helper (it uses Iterator::reduce and Expr::and, but the associativity of 
the resulting tree may differ). Semantically equivalent, but the emitted 
Substrait byte representation will change
     for joins with 2+ ON predicates plus a filter. Worth:
     - Calling this out in the PR description.
     - Adding a join-with-filter test (or sqllogictest) that exercises both on 
and filter simultaneously and asserts the produced plan, so future refactors 
don't silently change wire format
     again.
   
     P2 — Test coverage is thin
   
     Only one new test, covering the simplest case (int64 + int64). Given the 
scope (binary, unary, scalar, higher-order, join refactor, between refactor), 
consider adding:
     - Unary case (Expr::Not(...) → Boolean, nullable matches input).
     - A nullable-vs-non-nullable case to confirm is_nullable() propagates into 
the Substrait type's nullability.
     - A from_between test that round-trips both BETWEEN and NOT BETWEEN — the 
refactor here is non-trivial and lacks direct coverage.
     - A from_join test asserting the join expression structure after the 
refactor.
   
     P3 — Expr::to_field(schema)? cost
   
     Expr::to_field recursively re-derives types from the schema. With this 
change, every from_binary_expr call now does a to_field over the whole 
subexpression and recurses into each child,
     which itself re-runs to_field on its subtree. That makes typing O(depth × 
nodes) over the expression tree. For typical SQL this is unnoticeable, but for 
generated/synthetic deep
     expressions it can be a quadratic blow-up. Not a blocker, but worth 
knowing — caching (data_type, nullable) on the way down (or accepting an 
already-computed type from the caller) would
     fix it.
   
     P4 — Minor cleanups
   
     - Expr::ScalarFunction(fun.clone()).to_field(schema)? and 
Expr::BinaryExpr(expr.clone()).to_field(schema)? clone the expression solely to 
call a trait method on &Expr. If ExprSchemable
     exposed equivalents on the inner types (ScalarFunction, BinaryExpr), the 
clones could be avoided. Not introduced by this PR, but the new code makes them 
more prevalent.
     - In from_between, the *expr.clone() pattern occurs four times in the 
BETWEEN-not-negated branch:
     Expr::and(
         Expr::lt_eq(*low.clone(), *expr.clone()),
         Expr::lt_eq(*expr.clone(), *high.clone()),
     )
     - A let expr_value = (*expr).clone(); once at the top would tighten this.
     - The new test is #[tokio::test] but the body is fully synchronous — 
#[test] is sufficient.
     - to_substrait_type(producer, output_field.data_type(), 
output_field.is_nullable()) is called 3 times. The existing 
to_substrait_type_from_field would be a slightly cleaner call (after
     wrapping Field → FieldRef). Pure nit.
   ```
   
   I don't think I'm two worried about P1.  The plans are still roughly 
equivalent.  Might be worth just calling out in the PR description.
   
   I agree with P2, in particular, we should have a test for the join case.
   
   I think P3 is inevitable and somewhat accepted at the moment.
   
   Agree that the things in P4 are minor, take them or leave them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Set Substrait output types for expressions [datafusion]

Reply via email to