Re: partitionBy creating lot of small files

Enrico Minack Sat, 04 Jun 2022 14:52:21 -0700

You refer to df.write.partitionBy, which creates for each value of "col"a directory, and in worst-case writes one file per DataFrame partition.So the number of output files is controlled by cardinality of "col",which is your data and hence out of control, and the number ofpartitions of your DataFrame.

The only way to change the number of DataFrame partitions withoutrepartition / shuffle all data is to use coalesce (as you alreadymentioned in an earlier post).

Repartition the DataFrame with the same column that you partitionBy willoutput a single file per col1-partition:

|ds.repartition(100, $"col1") .write .partitionBy("col1").parquet("data.parquet")|

Large col1-values with much data will have a large file and col1-valueswith few data will have a small file.

If even-sized files is of great value for you, repartition / shuffle oreven range partition might pay off:

|ds.repartitionByRange(100, $"col1", $"col2") .write.partitionBy("col1") .parquet("data.parquet")|

This will give you equal-size files (given (col1, col2) has evendistribution) with many files for large col1-partitions and few filesfor small col1-partitions.


You can even emulate some kind of bucketing with:

|ds|||.withColumn("month", month($"timestamp")) |.withColumn("year",year($"timestamp")) .repartitionByRange(100, $"year", $"month", $"id",$"time") .write .partitionBy("year", "month") .parquet("data.parquet")|

Files have similar size while large months have more files than smallmonths.


https://github.com/G-Research/spark-extension/blob/master/PARTITIONING.md

Enrico


Am 04.06.22 um 18:44 schrieb Nikhil Goyal:

Hi all,
Is there a way to use dataframe.partitionBy("col") and control thenumber of output files without doing a full repartition? The thing issome partitions have more data while some have less. Doing a.repartition is a costly operation. We want to control the size of theoutput files. Is it even possible?
Thanks

Re: partitionBy creating lot of small files

Reply via email to