alamb commented on issue #2581:
URL: https://github.com/apache/datafusion/issues/2581#issuecomment-2571424336

   > There are also real optimizations available here. For example, suppose I 
write an Arrow int8 column to Parquet. The Arrow schema is serialized into 
Parquet metadata so at read time the column is read back as int8. If a scalar 
expression tries to sum this column with an i32, e.g. `SELECT col + 10i32`, 
then DataFusion inserts an upcast. Today, this results in decoding the Parquet 
column (whose smallest physical integer type is int32) into an Arrow int32 
array, then [casting to an 
int8](https://github.com/apache/arrow-rs/blob/b77d38d022079b106449ead3996f373edc906327/parquet/src/arrow/array_reader/primitive_array.rs#L273),
 then DataFusion casting back to an int32.
   
   I tried to reproduce the issue you described and I could not
   
   Specifically, I think in this case DataFusion actully casts the `10` to 
`Int8` and evaluate that directly against the contents of the column. Here is 
what I tried:
   
   ```sql
   DataFusion CLI v44.0.0
   > copy (select arrow_cast(1, 'Int8') as x) to '/tmp/foo.parquet';
   +-------+
   | count |
   +-------+
   | 1     |
   +-------+
   1 row(s) fetched.
   Elapsed 0.062 seconds.
   
   > describe '/tmp/foo.parquet';
   +-------------+-----------+-------------+
   | column_name | data_type | is_nullable |
   +-------------+-----------+-------------+
   | x           | Int8      | NO          |
   +-------------+-----------+-------------+
   1 row(s) fetched.
   Elapsed 0.012 seconds.
   
   > explain select x = arrow_cast(10, 'Int32') from '/tmp/foo.parquet';
   
+---------------+-------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                           |
   
+---------------+-------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: /tmp/foo.parquet.x = Int8(10) AS 
/tmp/foo.parquet.x = arrow_cast(Int64(10),Utf8("Int32")) |
   |               |   TableScan: /tmp/foo.parquet projection=[x]               
                                           |
   | physical_plan | ProjectionExec: expr=[x@0 = 10 as /tmp/foo.parquet.x = 
arrow_cast(Int64(10),Utf8("Int32"))]           |
   |               |   ParquetExec: file_groups={1 group: [[tmp/foo.parquet]]}, 
projection=[x]                             |
   |               |                                                            
                                           |
   
+---------------+-------------------------------------------------------------------------------------------------------+
   2 row(s) fetched.
   Elapsed 0.017 seconds.
   ```
   
   
   Specifically this line:
   > | physical_plan | ProjectionExec: expr=[x@0 = 10 as /tmp/foo.parquet.x = 
arrow_cast(Int64(10),Utf8("Int32"))]           |
    
   the `x@0 = 10` means the `10` was cast to `Int8` to match the column type, 
not the other way around -- this is done by the UnwrapToCast AnalyzerPass
   
   There may be more complex examples (e.g. with nested types) where the 
constant cast can't be unwrapped


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to