kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3003420880
hi @adriangb > could you take a look at https://github.com/apache/datafusion/commit/32725dd621ec5e96caf1970433f3549dca977a80? 👍👍👍 The new tests in `PhysicalExprSchemaRewriter`’s suite—including your “`test_adapt_batches`” example—*do* demonstrate that we can: 1. **Rewrite** a projection (or filter) against a physical `RecordBatch`, 2. **Evaluate** those rewritten `PhysicalExpr`s on the old batch to 3. **Produce** a brand-new `RecordBatch` that (a) injects nulls for missing top-level columns, (b) applies casts, and (c) drops extra columns—all in one go. So yes, for **flat** schemas and simple projections/filters, we can entirely sidestep a separate `map_batch` / `cast_struct_column` step by: - Generating an expression per target column (either `col(...)` if present or `lit(NULL)` if absent), - Letting the engine “evaluate” those expressions into new arrays, and - Bundling them into a fresh `RecordBatch`. ✅ **Where the rewrite-only approach shines** - **Simplicity** for top-level columns. We only need `col` + `lit(NULL)` + `CastExpr`. - **Unified code path**: predicates *and* projections both go through the same rewriter + evaluator. - **Less bespoke iterator logic**: no custom `StructArray` walks, no recursive field-matching loops. --- ⚠️ **Where the schema-adapter approach still wins** 1. **Deeply nested structs** - There’s *no* built-in “struct constructor” expression in DataFusion’s evaluator that I know of Our rewrite + `batch_project` hack only handles top-level arrays. We can’t easily say *“build a `StructArray` whose fields are (`col("a.b")`, `lit(NULL)`, `cast(col("a.c")`, …))”* purely in expression form 2. **Performance** - Expression evaluation involves building `ArrayRef`s by walking millions of rows through the `PhysicalExpr` vtable. - The adapter’s `cast_struct_column` does one recursive scan through each `StructArray`'s memory, which is far more cache-friendly for bulk columnar operations. 3. **Full schema fidelity** - The rewrite test only demonstrates: - *Drop* “extra” columns, - *Inject null* for missing *top-level* columns, - *Cast* primitive types. - It doesn’t cover: - **Adding** a new nested struct (we’d need to build that `StructArray` via expressions we don’t have), - **Recursively** updating sub-children, - **Preserving** null bit-maps across nested levels. ``` > Complex handling for deeply nested types. I do think this is a concern, I'm not sure how hard it would be to actually implement, but it's theoretically very possible ``` ### Why it’s possible but a lot of work (and a performance risk) 1. **Engineering effort** - We’ll be growing the rewriter from a column/cast replacer into a full recursive schema-walker + struct-node constructor. - We’ll have to handle every corner case: non-nullable nested fields, mixed present+missing children, ordering of fields, metadata, etc. 2. **Runtime performance** - Every invocation in the evaluator will loop over *rows* to build child arrays, then pack them into a `StructArray`. - That’s orders of magnitude slower than a tight `cast_struct_column` implementation that does one bulk pass through the existing `StructArray` buffers. I hope I don't sound like I am dead against the rewrite approach. It is more like you have shown me a puzzle that I don't know how to solve. ### What I would love to hear Here's a simpler and faster approach ..... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org