kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994965537
Putting more words to how I understand pushdown and data adaptation:
1. Pushdown — “Which rows or pages should I read?”
- Input: your original predicate (e.g. col("foo.b") > 5) and the physical
Parquet schema.
- What the rewriter does:
- Sees that foo.b doesn’t exist on disk → replaces col("foo.b") > 5 with
lit(NULL) > 5.
- Or if foo.a is stored as Int32 but the table expects Int64, it wraps
col("foo.a") in a cast.
- Result: you get a “safe” predicate that Parquet can evaluate against
row‐group statistics or pages without error.
- Outcome: you prune away unneeded row groups, or skip pages, based on that
rewritten expression.
At the end of this step, no data has actually been materialized—you’ve only
modified the expression you use to decide what to read.
2. Data adaptation — “How do I shape the in-memory batch to match the
logical schema?”
- Input: a RecordBatch (or StructArray) that you read directly from Parquet.
- This batch is laid out exactly as on disk: it only has the columns that
existed in that file’s schema, and nested structs only contain the old fields.
- What the adapter does (map_batch / cast_struct_column):
- Field matching: for each field in your logical (table) schema, look it
up by name in the batch’s arrays.
- Missing fields → insert a new_null_array(...) of the right datatype and
row count.
- Extra fields (present on disk but dropped in the table) → ignore them.
- Nested structs → recurse into child struct arrays, doing the same
match/fill/ignore/cast logic at each level.
- Result: a brand-new StructArray (and overall RecordBatch) whose columns
exactly line up with your table schema—even for deeply nested new fields.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]