Nikhil,
What are you trying to achieve with this in the first place? What are
your goals? What is the problem with your approach?
Are you concerned about the 1000 files in each written col2-partition?
The write.partitionBy is something different that df.repartition or
df.coalesce.
The df partitions are sorted *before* partitionBy-writing them.
Enrico
Am 03.06.22 um 16:13 schrieb Nikhil Goyal:
Hi folks,
We are trying to do
`df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)`
I do see that coalesceĀ 1000 is applied for every sub partition. But I
wanted to know if sortWithinPartitions(col1) works after applying
partitionBy or before? Basically would spark first partitionBy col2
and then sort by col1 or sort first and then partition?
Thanks
Nikhil