Hi Enrico, Thanks for replying. I want to partition by a column and then be able to sort within those partitions based on another column. DataframeWriter has sortBy and bucketBy but it requires creating a new table (Can only use `saveAsTable` but not just `save`). I can write another job on top which does the sorting but that complicates the code. So is there a clever way to sort records after they have been partitioned?
Thanks Nikhil On Fri, Jun 3, 2022 at 9:38 AM Enrico Minack <i...@enrico.minack.dev> wrote: > Nikhil, > > What are you trying to achieve with this in the first place? What are your > goals? What is the problem with your approach? > > Are you concerned about the 1000 files in each written col2-partition? > > The write.partitionBy is something different that df.repartition or > df.coalesce. > > The df partitions are sorted *before* partitionBy-writing them. > > Enrico > > > Am 03.06.22 um 16:13 schrieb Nikhil Goyal: > > Hi folks, > > We are trying to do > ` > df.coalesce(1000).sortWithinPartitions("col1").write.mode('overwrite').partitionBy("col2").parquet(...) > ` > > I do see that coalesce 1000 is applied for every sub partition. But I > wanted to know if sortWithinPartitions(col1) works after applying > partitionBy or before? Basically would spark first partitionBy col2 and > then sort by col1 or sort first and then partition? > > Thanks > Nikhil > > >