adriangb opened a new issue, #16800:
URL: https://github.com/apache/datafusion/issues/16800

   As discussed in https://github.com/apache/datafusion/pull/16791 the long 
term plan in my mind (and that I would like to discuss with the community) is 
to replace `SchemaAdapter` with `PhysicalExprAdapter`.
   
   There are multiple reasons for this:
   - We can better optimize scenarios like missing columns or casts. For 
example, it's cheaper to cast a literal and evaluate it against the data as 
read from the file than it is to read the data from the file and cast that to 
the type of the literal. It is also cheaper to evaluate the expression `1 > 
col1` as `1 > null` when `col1` is missing than it is to create an array of 
nulls. Since we can also simplify `PhysicalExpr` we can even simplify `1 > 
null` into just `null`.
   - It's easier to manipulate `PhysicalExpr`s than it is to manipulate arrays. 
We already have machinery (`TreeNode` APIs, etc.) to do so.
   - This is necessary to be able to push down projections into file scans 
which we need for upcoming [Variant work 
](https://github.com/apache/arrow-rs/issues/6736) and will also allow us to 
read single fields in a struct without reading the entire struct into memory.
   - Paves the path for any other advanced optimizations, e.g. we could do 
crazy stuff like only read the dictionary page from a parquet column for a 
filter `col = 'a'` and if `'a'` is not in the dictionary don't even bother 
reading the keys.
   
   We've already implemented a replacement system for predicate pushdown via 
`PhysicalExprAdapter` and have examples showing how to do some of the things a 
custom SchemaAdapter can do.
   Once we implement https://github.com/apache/datafusion/issues/14993 we'll be 
able to deprecate SchemaAdapter for the most part.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to