adriangb commented on issue #15780: URL: https://github.com/apache/datafusion/issues/15780#issuecomment-2971805394
@alamb I tried to put together an example of schema evolution where the file had a Int32 column at the file schema level and the table has it as Int64, I can see the extra conversion happening if I but prints in the right place _but_ I could not get it to show a meaningful difference in performance, I'm guessing because most of the time and variability comes from reading the data, parsing parquet, etc. and once it's in memory as arrow converting from Int32 to Int64 is trivial. This did make me notice: we pretty much already have a special case of what I am proposing here: https://github.com/apache/datafusion/blob/5a2933e5878777c75c931b42327b1074bcd43d35/datafusion/datasource-parquet/src/opener.rs#L221-L234 It only deals with the View types, which I think are a special case of the general idea. The general idea is to look at the filters and projection and decide what the most efficient way to marry up the data is. Currently it is _always_ converting the data to match the filter / table field types after it has been read from parquet. What I am thinking is: 1. For any cases where we can read from parquet directly into the table type do that (e.g. `StringView` and `Utf8` are stored the same in Parquet, both `UInt8` and `UInt16` are stored as `INT32` in Parquet). This is basically what is happening with the view types already. 2. If a cast is required prefer to cast the filter / scalar values or the smaller column (we can use similar logic as what we use to reorder filters). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org