Kontinuation commented on issue #1844:
URL: 
https://github.com/apache/datafusion-comet/issues/1844#issuecomment-2940507433

   > Possibly related to 
[#1823](https://github.com/apache/datafusion-comet/issues/1823) (i.e., not 
necessarily delta specific)?
   
   Probably, the stack trace and error message looks very similar.
   
   In our case, the exception was thrown when reading delta logs, which are a 
bunch of JSON files. Delta uses custom schema including all types of delta 
transactions, and the resulting DataFrame contains rows that have null struct 
fields. Shuffle writing these rows containing null struct fields throws the 
exception.
   
   We have constructed a minimal repo that does not require Delta:
   
   ```scala
   val testData = "{}\n"
   val path = Paths.get(dir.toString, "test.json")
   Files.write(path, testData.getBytes)
   
   // Define the nested struct schema
   val readSchema = StructType(
     Array(
       StructField(
         "metaData",
         StructType(
           Array(StructField(
             "format",
             StructType(Array(StructField("provider", StringType, nullable = 
true))),
             nullable = true))),
         nullable = true)))
   
   // Read JSON with custom schema and repartition. The repartitioned data 
contains null structs
   val df = 
spark.read.format("json").schema(readSchema).load(path.toString).repartition(2)
   df.show() // <-- throws exception
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to