kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994965537
Putting more words to how I understand pushdown and data adaptation: 1. Pushdown — “Which rows or pages should I read?” - Input: your original predicate (e.g. col("foo.b") > 5) and the physical Parquet schema. - What the rewriter does: - Sees that foo.b doesn’t exist on disk → replaces col("foo.b") > 5 with lit(NULL) > 5. - Or if foo.a is stored as Int32 but the table expects Int64, it wraps col("foo.a") in a cast. - Result: you get a “safe” predicate that Parquet can evaluate against row‐group statistics or pages without error. - Outcome: you prune away unneeded row groups, or skip pages, based on that rewritten expression. At the end of this step, no data has actually been materialized—you’ve only modified the expression you use to decide what to read. 2. Data adaptation — “How do I shape the in-memory batch to match the logical schema?” - Input: a RecordBatch (or StructArray) that you read directly from Parquet. - This batch is laid out exactly as on disk: it only has the columns that existed in that file’s schema, and nested structs only contain the old fields. - What the adapter does (map_batch / cast_struct_column): - Field matching: for each field in your logical (table) schema, look it up by name in the batch’s arrays. - Missing fields → insert a new_null_array(...) of the right datatype and row count. - Extra fields (present on disk but dropped in the table) → ignore them. - Nested structs → recurse into child struct arrays, doing the same match/fill/ignore/cast logic at each level. - Result: a brand-new StructArray (and overall RecordBatch) whose columns exactly line up with your table schema—even for deeply nested new fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org