You refer to df.write.partitionBy, which creates for each value of "col"
a directory, and in worst-case writes one file per DataFrame partition.
So the number of output files is controlled by cardinality of "col",
which is your data and hence out of control, and the number of
partitions of your DataFrame.
The only way to change the number of DataFrame partitions without
repartition / shuffle all data is to use coalesce (as you already
mentioned in an earlier post).
Repartition the DataFrame with the same column that you partitionBy will
output a single file per col1-partition:
|ds.repartition(100, $"col1") .write .partitionBy("col1")
.parquet("data.parquet")|
Large col1-values with much data will have a large file and col1-values
with few data will have a small file.
If even-sized files is of great value for you, repartition / shuffle or
even range partition might pay off:
|ds.repartitionByRange(100, $"col1", $"col2") .write
.partitionBy("col1") .parquet("data.parquet")|
This will give you equal-size files (given (col1, col2) has even
distribution) with many files for large col1-partitions and few files
for small col1-partitions.
You can even emulate some kind of bucketing with:
|ds|||.withColumn("month", month($"timestamp")) |.withColumn("year",
year($"timestamp")) .repartitionByRange(100, $"year", $"month", $"id",
$"time") .write .partitionBy("year", "month") .parquet("data.parquet")|
Files have similar size while large months have more files than small
months.
https://github.com/G-Research/spark-extension/blob/master/PARTITIONING.md
Enrico
Am 04.06.22 um 18:44 schrieb Nikhil Goyal:
Hi all,
Is there a way to use dataframe.partitionBy("col") and control the
number of output files without doing a full repartition? The thing is
some partitions have more data while some have less. Doing a
.repartition is a costly operation. We want to control the size of the
output files. Is it even possible?
Thanks