Re: repartition before writing to table with bucketed partitioning

2024-12-01 Thread Soumasish
I might have misunderstood the issue. Spark indeed will repartition the data while writing, what it won't do is write precisely 10 files inside each date partition folder sorted by col x. Typically this kind of fine grained write config is useful if there's a downstream consumer that will use the o

Re: repartition before writing to table with bucketed partitioning

2024-12-01 Thread Henryk Česnolovič
Henryk Česnolovič 08:30 (5 hours ago) to Soumasish Ok nvm. Seems we don't need to do repartition, as spark handles itself. df.writeTo("some_table").partitionedBy(col("date"), col("x"), bucket(10, col("y"))).using("iceberg").createOrReplace() or later df.writeTo("some_table").append() spark unders

Re: repartition before writing to table with bucketed partitioning

2024-11-30 Thread Soumasish
Henryk, I could reproduce your issue and achieve the desired result using SQL DDL. Here's the workaround. package replicator import org.apache.spark.sql.SparkSession object Bucketing extends App { val spark = SparkSession.builder() .appName("ReproduceError") .master("local[*]")

repartition before writing to table with bucketed partitioning

2024-11-29 Thread Henryk Česnolovič
Hello. Maybe somebody has faced the same issue. Trying to write data to the table while using DataFrame API v2. Table is partitioned by buckets using df.writeTo("some_table").partitionedBy(col("date"), col("x"), bucket(10, col("y"))).using("iceberg").createOrReplace() Can I somehow prepare df in