comphead commented on issue #1789: URL: https://github.com/apache/datafusion-comet/issues/1789#issuecomment-2920846368
So the issue goes down to the reader, the problem is Spark can understand on fly parquet schema whereas DataFusion still static. To reproduce the problem Create some data ``` val q = """ | select map(str0, str1) c0 from | ( | select named_struct('a', cast(3 as long), 'b', cast(4 as long), 'c', cast(5 as long)) str0, | named_struct('x', cast(6 as long), 'y', 'abc', 'z', cast(8 as long)) str1 union all | select named_struct('a', cast(31 as long), 'b', cast(41 as long), 'c', cast(51 as long)), null | ) |""".stripMargin spark.sql(q).repartition(1).write.parquet("/tmp/t1") ``` Check parquet file metadata. The file meta shows both the fields `x, y, z` are required, but entire group `value` is optional. This is quite weird. Spark schema in properties shows `value` is not nullable although it contains null ``` parquet meta /tmp/t1/part-00000-340a4bdf-2e2c-42a8-a38a-01b47ab7d3c0-c000.snappy.parquet File path: /tmp/t1/part-00000-340a4bdf-2e2c-42a8-a38a-01b47ab7d3c0-c000.snappy.parquet Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba) Properties: org.apache.spark.version: 3.5.5 org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"c0","type":{"type":"map","keyType":{"type":"struct","fields":[{"name":"a","type":"long","nullable":false,"metadata":{}},{"name":"b","type":"long","nullable":false,"metadata":{}},{"name":"c","type":"long","nullable":false,"metadata":{}}]},"valueType":{"type":"struct","fields":[{"name":"x","type":"long","nullable":false,"metadata":{}},{"name":"y","type":"string","nullable":false,"metadata":{}},{"name":"z","type":"long","nullable":false,"metadata":{}}]},"valueContainsNull":true},"nullable":false,"metadata":{}}]} Schema: message spark_schema { required group c0 (MAP) { repeated group key_value { required group key { required int64 a; required int64 b; required int64 c; } optional group value { required int64 x; required binary y (STRING); required int64 z; } } } } ``` Reading the file through Spark and print schema. Note `x, y, z` are nullable now although it is not in Parquet. ``` scala> spark.read.parquet("/tmp/t1").printSchema root |-- c0: map (nullable = true) | |-- key: struct | | |-- a: long (nullable = true) | | |-- b: long (nullable = true) | | |-- c: long (nullable = true) | |-- value: struct (valueContainsNull = true) | | |-- x: long (nullable = true) | | |-- y: string (nullable = true) | | |-- z: long (nullable = true) ``` I assuma Spark is smart enough to infer nullability of `x,y,z` based on the `value` is optional in Parquet file and tweak the schema accordingly. DataFusion cannot do that afaik. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org