Re: PartitionBy and SortWithinPartitions

Enrico Minack Fri, 03 Jun 2022 09:38:42 -0700

Nikhil,

What are you trying to achieve with this in the first place? What areyour goals? What is the problem with your approach?


Are you concerned about the 1000 files in each written col2-partition?

The write.partitionBy is something different that df.repartition ordf.coalesce.


The df partitions are sorted *before* partitionBy-writing them.

Enrico


Am 03.06.22 um 16:13 schrieb Nikhil Goyal:

Hi folks,

We are trying to do
`df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)`
I do see that coalesce 1000 is applied for every sub partition. But Iwanted to know if sortWithinPartitions(col1) works after applyingpartitionBy or before? Basically would spark first partitionBy col2and then sort by col1 or sort first and then partition?
Thanks
Nikhil

Re: PartitionBy and SortWithinPartitions

Reply via email to