Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

via GitHub Mon, 23 Jun 2025 00:52:16 -0700


kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994965537


   Putting more words to how I understand pushdown and data adaptation:
   
   1. Pushdown — “Which rows or pages should I read?”
   - Input: your original predicate (e.g. col("foo.b") > 5) and the physical 
Parquet schema.
   
   - What the rewriter does:
      - Sees that foo.b doesn’t exist on disk → replaces col("foo.b") > 5 with 
lit(NULL) > 5.
      - Or if foo.a is stored as Int32 but the table expects Int64, it wraps 
col("foo.a") in a cast.
   
   - Result: you get a “safe” predicate that Parquet can evaluate against 
row‐group statistics or pages without error.
   
   - Outcome: you prune away unneeded row groups, or skip pages, based on that 
rewritten expression.
   
   At the end of this step, no data has actually been materialized—you’ve only 
modified the expression you use to decide what to read.
   
   2. Data adaptation — “How do I shape the in-memory batch to match the 
logical schema?”
   - Input: a RecordBatch (or StructArray) that you read directly from Parquet.
   
     - This batch is laid out exactly as on disk: it only has the columns that 
existed in that file’s schema, and nested structs only contain the old fields.
   
   - What the adapter does (map_batch / cast_struct_column):
     - Field matching: for each field in your logical (table) schema, look it 
up by name in the batch’s arrays.
     - Missing fields → insert a new_null_array(...) of the right datatype and 
row count.
     - Extra fields (present on disk but dropped in the table) → ignore them.
     - Nested structs → recurse into child struct arrays, doing the same 
match/fill/ignore/cast logic at each level.
   
   - Result: a brand-new StructArray (and overall RecordBatch) whose columns 
exactly line up with your table schema—even for deeply nested new fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

Reply via email to