Re: PartitionBy and SortWithinPartitions

Nikhil Goyal Fri, 03 Jun 2022 09:51:26 -0700

Hi Enrico,
Thanks for replying. I want to partition by a column and then be able to
sort within those partitions based on another column. DataframeWriter has
sortBy and bucketBy but it requires creating a new table (Can only use
`saveAsTable` but not just `save`). I can write another job on top which
does the sorting but that complicates the code. So is there a clever way to
sort records after they have been partitioned?


Thanks
Nikhil

On Fri, Jun 3, 2022 at 9:38 AM Enrico Minack <i...@enrico.minack.dev> wrote:

> Nikhil,
>
> What are you trying to achieve with this in the first place? What are your
> goals? What is the problem with your approach?
>
> Are you concerned about the 1000 files in each written col2-partition?
>
> The write.partitionBy is something different that df.repartition or
> df.coalesce.
>
> The df partitions are sorted *before* partitionBy-writing them.
>
> Enrico
>
>
> Am 03.06.22 um 16:13 schrieb Nikhil Goyal:
>
> Hi folks,
>
> We are trying to do
> `
> df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...)
> `
>
> I do see that coalesce 1000 is applied for every sub partition. But I
> wanted to know if sortWithinPartitions(col1) works after applying
> partitionBy or before? Basically would spark first partitionBy col2 and
> then sort by col1 or sort first and then partition?
>
> Thanks
> Nikhil
>
>
>

Re: PartitionBy and SortWithinPartitions

Reply via email to