comphead commented on issue #1789:
URL: 
https://github.com/apache/datafusion-comet/issues/1789#issuecomment-2920846368

   So the issue goes down to the reader, the problem is Spark can understand on 
fly parquet schema whereas DataFusion still static. To reproduce the problem
   
   Create some data
   ```
     val q = """
             | select map(str0, str1) c0 from
             | (
             |   select named_struct('a', cast(3 as long), 'b', cast(4 as 
long), 'c', cast(5 as long)) str0,
             |          named_struct('x', cast(6 as long), 'y', 'abc', 'z', 
cast(8 as long)) str1 union all
             |   select named_struct('a', cast(31 as long), 'b', cast(41 as 
long), 'c', cast(51 as long)), null
             | )
             |""".stripMargin
   
   spark.sql(q).repartition(1).write.parquet("/tmp/t1")
   ```
   
   Check parquet file metadata. The file meta shows both the fields `x, y, z` 
are required, but entire group `value` is optional. This is quite weird. Spark 
schema in properties shows `value` is not nullable although it contains null
   ```
   parquet meta 
/tmp/t1/part-00000-340a4bdf-2e2c-42a8-a38a-01b47ab7d3c0-c000.snappy.parquet
   
   File path:  
/tmp/t1/part-00000-340a4bdf-2e2c-42a8-a38a-01b47ab7d3c0-c000.snappy.parquet
   Created by: parquet-mr version 1.13.1 (build 
db4183109d5b734ec5930d870cdae161e408ddba)
   Properties:
                      org.apache.spark.version: 3.5.5
     org.apache.spark.sql.parquet.row.metadata: 
{"type":"struct","fields":[{"name":"c0","type":{"type":"map","keyType":{"type":"struct","fields":[{"name":"a","type":"long","nullable":false,"metadata":{}},{"name":"b","type":"long","nullable":false,"metadata":{}},{"name":"c","type":"long","nullable":false,"metadata":{}}]},"valueType":{"type":"struct","fields":[{"name":"x","type":"long","nullable":false,"metadata":{}},{"name":"y","type":"string","nullable":false,"metadata":{}},{"name":"z","type":"long","nullable":false,"metadata":{}}]},"valueContainsNull":true},"nullable":false,"metadata":{}}]}
   Schema:
   message spark_schema {
     required group c0 (MAP) {
       repeated group key_value {
         required group key {
           required int64 a;
           required int64 b;
           required int64 c;
         }
         optional group value {
           required int64 x;
           required binary y (STRING);
           required int64 z;
         }
       }
     }
   }
   ```
   Reading the file through Spark and print schema. Note `x, y, z` are nullable 
now although it is not in Parquet. 
   ```
   scala> spark.read.parquet("/tmp/t1").printSchema
   root
    |-- c0: map (nullable = true)
    |    |-- key: struct
    |    |    |-- a: long (nullable = true)
    |    |    |-- b: long (nullable = true)
    |    |    |-- c: long (nullable = true)
    |    |-- value: struct (valueContainsNull = true)
    |    |    |-- x: long (nullable = true)
    |    |    |-- y: string (nullable = true)
    |    |    |-- z: long (nullable = true)
   ```
   
   I assuma Spark is smart enough to infer nullability of `x,y,z` based on the 
`value` is optional in Parquet file and tweak the schema accordingly. 
DataFusion cannot do that afaik.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to