Optimizing file size of an iceberg table

Pathum Wijethunge Mon, 03 Mar 2025 00:02:07 -0800

Hi Team,

I have a use-case of writing data into an icebergtable using spark.
The table has 3 partition columns (file_date, city, creation_date) and
a bucket by another column (user_id,4).




  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "file_date",
      "transform" : "identity",
      "source-id" : 4,
      "field-id" : 1000
    }, {
      "name" : "city",
      "transform" : "identity",
      "source-id" : 8,
      "field-id" : 1001
    }, {
      "name" : "creation_date",
      "transform" : "identity",
      "source-id" : 1,
      "field-id" : 1002
    }, {
      "name" : "user_id_bucket",
      "transform" : "bucket[4]",
      "source-id" : 2,
      "field-id" : 1003
    } ]
  } ],

Please refer to the attachment.

It seems like when
spark.sql.adaptive.advisoryPartitionSizeInBytes (1GB) and Shuffle Read Size
(24.3 GB) are constants and,
file size increases with the spark.sql.shuffle.partitions value.

Please note that i have set
spark.sql.adaptive.advisoryPartitionSizeInBytes = 1000000000
Only for the write operation.

all the other spark.sql.adaptive configs are set to default.



How can we explain the behavior of average file size written by Spark in
the iceberg table?

Spark version - Spark 3.5.0
Iceberg version - Iceberg 1.7.1-amzn-0


Regards,

Pathum Wijethunge.

Optimizing file size of an iceberg table

Reply via email to