Nikhil,

What are you trying to achieve with this in the first place? What are your goals? What is the problem with your approach?

Are you concerned about the 1000 files in each written col2-partition?

The write.partitionBy is something different that df.repartition or df.coalesce.

The df partitions are sorted *before* partitionBy-writing them.

Enrico


Am 03.06.22 um 16:13 schrieb Nikhil Goyal:
Hi folks,

We are trying to do
`df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)`

I do see that coalesceĀ 1000 is applied for every sub partition. But I wanted to know if sortWithinPartitions(col1) works after applying partitionBy or before? Basically would spark first partitionBy col2 and then sort by col1 or sort first and then partition?

Thanks
Nikhil

Reply via email to