Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

via GitHub Tue, 24 Jun 2025 22:50:05 -0700


kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-3003420880


   hi @adriangb 
   
   > could you take a look at 
https://github.com/apache/datafusion/commit/32725dd621ec5e96caf1970433f3549dca977a80?
   
   👍👍👍
    The new tests in  
   `PhysicalExprSchemaRewriter`’s suite—including your “`test_adapt_batches`” 
example—*do* demonstrate that we can:
   
   1. **Rewrite** a projection (or filter) against a physical `RecordBatch`,
   
   2. **Evaluate** those rewritten `PhysicalExpr`s on the old batch to
   
   3. **Produce** a brand-new `RecordBatch` that (a) injects nulls for missing 
top-level columns, (b) applies casts, and (c) drops extra columns—all in one go.
   
   So yes, for **flat** schemas and simple projections/filters, we can entirely 
sidestep a separate `map_batch` / `cast_struct_column` step by:
   
   - Generating an expression per target column (either `col(...)` if present 
or `lit(NULL)` if absent),
   - Letting the engine “evaluate” those expressions into new arrays, and
   - Bundling them into a fresh `RecordBatch`.
   
   ✅ **Where the rewrite-only approach shines**
   
   - **Simplicity** for top-level columns. We only need `col` + `lit(NULL)` + 
`CastExpr`.
   - **Unified code path**: predicates *and* projections both go through the 
same rewriter + evaluator.
   - **Less bespoke iterator logic**: no custom `StructArray` walks, no 
recursive field-matching loops.
   
   ---
   
   ⚠️ **Where the schema-adapter approach still wins**
   
   1. **Deeply nested structs**
   
      - There’s *no* built-in “struct constructor” expression in DataFusion’s 
evaluator that I know of
        Our rewrite + `batch_project` hack only handles top-level arrays. We 
can’t easily say  
        *“build a `StructArray` whose fields are (`col("a.b")`, `lit(NULL)`, 
`cast(col("a.c")`, …))”* purely in expression form
   
   2. **Performance**
   
      - Expression evaluation involves building `ArrayRef`s by walking millions 
of rows through the `PhysicalExpr` vtable.
      - The adapter’s `cast_struct_column` does one recursive scan through each 
`StructArray`'s memory, which is far more cache-friendly for bulk columnar 
operations.
   
   3. **Full schema fidelity**
   
      - The rewrite test only demonstrates:
        - *Drop* “extra” columns,
        - *Inject null* for missing *top-level* columns,
        - *Cast* primitive types.
   
      - It doesn’t cover:
   
        - **Adding** a new nested struct (we’d need to build that `StructArray` 
via expressions we don’t have),
        - **Recursively** updating sub-children,
        - **Preserving** null bit-maps across nested levels.
   
   ```
   > Complex handling for deeply nested types.
   I do think this is a concern, I'm not sure how hard it would be to actually 
implement, but it's theoretically very possible
   ```
   
   ### Why it’s possible but a lot of work (and a performance risk)
   
   1. **Engineering effort**
   
      - We’ll be growing the rewriter from a column/cast replacer into a full 
recursive schema-walker + struct-node constructor.
   
      - We’ll have to handle every corner case: non-nullable nested fields, 
mixed present+missing children, ordering of fields, metadata, etc.
   
   2. **Runtime performance**
   
      - Every invocation in the evaluator will loop over *rows* to build child 
arrays, then pack them into a `StructArray`.
   
      - That’s orders of magnitude slower than a tight `cast_struct_column` 
implementation that does one bulk pass through the existing `StructArray` 
buffers.
   
   
   I hope I don't sound like I am dead against the rewrite approach.
   It is more like you have shown me a puzzle that I don't know how to solve.
   
   ### What I would love to hear
   
   Here's a simpler and faster approach .....
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

Reply via email to