You refer to df.write.partitionBy, which creates for each value of "col" a directory, and in worst-case writes one file per DataFrame partition. So the number of output files is controlled by cardinality of "col", which is your data and hence out of control, and the number of partitions of your DataFrame.

The only way to change the number of DataFrame partitions without repartition / shuffle all data is to use coalesce (as you already mentioned in an earlier post).

Repartition the DataFrame with the same column that you partitionBy will output a single file per col1-partition:

|ds.repartition(100, $"col1") .write .partitionBy("col1") .parquet("data.parquet")|

Large col1-values with much data will have a large file and col1-values with few data will have a small file.

If even-sized files is of great value for you, repartition / shuffle or even range partition might pay off:

|ds.repartitionByRange(100, $"col1", $"col2") .write .partitionBy("col1") .parquet("data.parquet")|

This will give you equal-size files (given (col1, col2) has even distribution) with many files for large col1-partitions and few files for small col1-partitions.

You can even emulate some kind of bucketing with:

|ds|||.withColumn("month", month($"timestamp")) |.withColumn("year", year($"timestamp")) .repartitionByRange(100, $"year", $"month", $"id", $"time") .write .partitionBy("year", "month") .parquet("data.parquet")|

Files have similar size while large months have more files than small months.

https://github.com/G-Research/spark-extension/blob/master/PARTITIONING.md

Enrico


Am 04.06.22 um 18:44 schrieb Nikhil Goyal:
Hi all,

Is there a way to use dataframe.partitionBy("col") and control the number of output files without doing a full repartition? The thing is some partitions have more data while some have less. Doing a .repartition is a costly operation. We want to control the size of the output files. Is it even possible?

Thanks

Reply via email to