Hi Team, I have a use-case of writing data into an icebergtable using spark. The table has 3 partition columns (file_date, city, creation_date) and a bucket by another column (user_id,4).
"partition-specs" : [ { "spec-id" : 0, "fields" : [ { "name" : "file_date", "transform" : "identity", "source-id" : 4, "field-id" : 1000 }, { "name" : "city", "transform" : "identity", "source-id" : 8, "field-id" : 1001 }, { "name" : "creation_date", "transform" : "identity", "source-id" : 1, "field-id" : 1002 }, { "name" : "user_id_bucket", "transform" : "bucket[4]", "source-id" : 2, "field-id" : 1003 } ] } ], Please refer to the attachment. It seems like when spark.sql.adaptive.advisoryPartitionSizeInBytes (1GB) and Shuffle Read Size (24.3 GB) are constants and, file size increases with the spark.sql.shuffle.partitions value. Please note that i have set spark.sql.adaptive.advisoryPartitionSizeInBytes = 1000000000 Only for the write operation. all the other spark.sql.adaptive configs are set to default. How can we explain the behavior of average file size written by Spark in the iceberg table? Spark version - Spark 3.5.0 Iceberg version - Iceberg 1.7.1-amzn-0 Regards, Pathum Wijethunge.