westonpace commented on PR #20597:
URL: https://github.com/apache/datafusion/pull/20597#issuecomment-4490311248
Here's some feedback from Claude:
```
P1 — from_join behavioral change is observable on the wire
The old code combined on and filter as AND(reduce_left(on_conditions),
filter). The new code uses conjunction(on_conditions.chain(filter)), which
folds right in DataFusion's conjunction
helper (it uses Iterator::reduce and Expr::and, but the associativity of
the resulting tree may differ). Semantically equivalent, but the emitted
Substrait byte representation will change
for joins with 2+ ON predicates plus a filter. Worth:
- Calling this out in the PR description.
- Adding a join-with-filter test (or sqllogictest) that exercises both on
and filter simultaneously and asserts the produced plan, so future refactors
don't silently change wire format
again.
P2 — Test coverage is thin
Only one new test, covering the simplest case (int64 + int64). Given the
scope (binary, unary, scalar, higher-order, join refactor, between refactor),
consider adding:
- Unary case (Expr::Not(...) → Boolean, nullable matches input).
- A nullable-vs-non-nullable case to confirm is_nullable() propagates into
the Substrait type's nullability.
- A from_between test that round-trips both BETWEEN and NOT BETWEEN — the
refactor here is non-trivial and lacks direct coverage.
- A from_join test asserting the join expression structure after the
refactor.
P3 — Expr::to_field(schema)? cost
Expr::to_field recursively re-derives types from the schema. With this
change, every from_binary_expr call now does a to_field over the whole
subexpression and recurses into each child,
which itself re-runs to_field on its subtree. That makes typing O(depth ×
nodes) over the expression tree. For typical SQL this is unnoticeable, but for
generated/synthetic deep
expressions it can be a quadratic blow-up. Not a blocker, but worth
knowing — caching (data_type, nullable) on the way down (or accepting an
already-computed type from the caller) would
fix it.
P4 — Minor cleanups
- Expr::ScalarFunction(fun.clone()).to_field(schema)? and
Expr::BinaryExpr(expr.clone()).to_field(schema)? clone the expression solely to
call a trait method on &Expr. If ExprSchemable
exposed equivalents on the inner types (ScalarFunction, BinaryExpr), the
clones could be avoided. Not introduced by this PR, but the new code makes them
more prevalent.
- In from_between, the *expr.clone() pattern occurs four times in the
BETWEEN-not-negated branch:
Expr::and(
Expr::lt_eq(*low.clone(), *expr.clone()),
Expr::lt_eq(*expr.clone(), *high.clone()),
)
- A let expr_value = (*expr).clone(); once at the top would tighten this.
- The new test is #[tokio::test] but the body is fully synchronous —
#[test] is sufficient.
- to_substrait_type(producer, output_field.data_type(),
output_field.is_nullable()) is called 3 times. The existing
to_substrait_type_from_field would be a slightly cleaner call (after
wrapping Field → FieldRef). Pure nit.
```
I don't think I'm two worried about P1. The plans are still roughly
equivalent. Might be worth just calling out in the PR description.
I agree with P2, in particular, we should have a test for the join case.
I think P3 is inevitable and somewhat accepted at the moment.
Agree that the things in P4 are minor, take them or leave them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]