Hi Team,

I have a use-case of writing data into an icebergtable using spark.
The table has 3 partition columns (file_date, city, creation_date) and
a bucket by another column (user_id,4).



  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "file_date",
      "transform" : "identity",
      "source-id" : 4,
      "field-id" : 1000
    }, {
      "name" : "city",
      "transform" : "identity",
      "source-id" : 8,
      "field-id" : 1001
    }, {
      "name" : "creation_date",
      "transform" : "identity",
      "source-id" : 1,
      "field-id" : 1002
    }, {
      "name" : "user_id_bucket",
      "transform" : "bucket[4]",
      "source-id" : 2,
      "field-id" : 1003
    } ]
  } ],

Please refer to the attachment.

It seems like when
spark.sql.adaptive.advisoryPartitionSizeInBytes (1GB) and Shuffle Read Size
(24.3 GB) are constants and,
file size increases with the spark.sql.shuffle.partitions value.

Please note that i have set
spark.sql.adaptive.advisoryPartitionSizeInBytes = 1000000000
Only for the write operation.

all the other spark.sql.adaptive configs are set to default.



How can we explain the behavior of average file size written by Spark in
the iceberg table?

Spark version - Spark 3.5.0
Iceberg version - Iceberg 1.7.1-amzn-0


Regards,

Pathum Wijethunge.

Reply via email to