gatesn commented on issue #2581: URL: https://github.com/apache/datafusion/issues/2581#issuecomment-2569432044
I'm not sure I agree that these are two separate ideas, rather, a generalization of the existing notion of projection. Projection today is all about selecting some subset of columns _as-is_ from a table. A projection expression generalizes this to selecting some subset of columns _in some form_ from a table. I also don't think it adds that much complexity. Any sufficiently advanced table provider already has the logic to compute the equivalent of a projection mask from an expression for filter push-down. The code paths would be very similar. There are also real optimizations available here. For example, suppose I write an Arrow int8 column to Parquet. The Arrow schema is serialized into Parquet metadata so at read time the column is read back as int8. If a scalar expression tries to sum this column with an i32, e.g. `SELECT col + 10i32`, then DataFusion inserts an upcast. Today, this results in decoding the Parquet column (whose smallest physical integer type is int32) into an Arrow int32 array, then [casting to an int8](https://github.com/apache/arrow-rs/blob/b77d38d022079b106449ead3996f373edc906327/parquet/src/arrow/array_reader/primitive_array.rs#L273), then DataFusion casting back to an int32. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org