You refer to df.write.partitionBy, which creates for each value of "col"
a directory, and in worst-case writes one file per DataFrame partition.
So the number of output files is controlled by cardinality of "col",
which is your data and hence out of control, and the number of
partitions of your
Hi all,
Is there a way to use dataframe.partitionBy("col") and control the number
of output files without doing a full repartition? The thing is some
partitions have more data while some have less. Doing a .repartition is a
costly operation. We want to control the size of the output files. Is it
e