gatesn commented on issue #2581:
URL: https://github.com/apache/datafusion/issues/2581#issuecomment-2569432044

   I'm not sure I agree that these are two separate ideas, rather, a 
generalization of the existing notion of projection.
   
   Projection today is all about selecting some subset of columns _as-is_ from 
a table. A projection expression generalizes this to selecting some subset of 
columns _in some form_ from a table.
   
   I also don't think it adds that much complexity. Any sufficiently advanced 
table provider already has the logic to compute the equivalent of a projection 
mask from an expression for filter push-down. The code paths would be very 
similar.
   
   There are also real optimizations available here. For example, suppose I 
write an Arrow int8 column to Parquet. The Arrow schema is serialized into 
Parquet metadata so at read time the column is read back as int8. If a scalar 
expression tries to sum this column with an i32, e.g. `SELECT col + 10i32`, 
then DataFusion inserts an upcast. Today, this results in decoding the Parquet 
column (whose smallest physical integer type is int32) into an Arrow int32 
array, then [casting to an 
int8](https://github.com/apache/arrow-rs/blob/b77d38d022079b106449ead3996f373edc906327/parquet/src/arrow/array_reader/primitive_array.rs#L273),
 then DataFusion casting back to an int32.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to