adriangb commented on issue #15780:
URL: https://github.com/apache/datafusion/issues/15780#issuecomment-2971805394

   @alamb I tried to put together an example of schema evolution where the file 
had a Int32 column at the file schema level and the table has it as Int64, I 
can see the extra conversion happening if I but prints in the right place _but_ 
I could not get it to show a meaningful difference in performance, I'm guessing 
because most of the time and variability comes from reading the data, parsing 
parquet, etc. and once it's in memory as arrow converting from Int32 to Int64 
is trivial.
   
   This did make me notice: we pretty much already have a special case of what 
I am proposing here: 
https://github.com/apache/datafusion/blob/5a2933e5878777c75c931b42327b1074bcd43d35/datafusion/datasource-parquet/src/opener.rs#L221-L234
   
   It only deals with the View types, which I think are a special case of the 
general idea.
   
   The general idea is to look at the filters and projection and decide what 
the most efficient way to marry up the data is. Currently it is _always_ 
converting the data to match the filter / table field types after it has been 
read from parquet. What I am thinking is:
   1. For any cases where we can read from parquet directly into the table type 
do that (e.g. `StringView` and `Utf8` are stored the same in Parquet, both 
`UInt8` and `UInt16` are stored as `INT32` in Parquet). This is basically what 
is happening with the view types already.
   2. If a cast is required prefer to cast the filter / scalar values or the 
smaller column (we can use similar logic as what we use to reorder filters).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to