No, as you shuffle each time again (you always partition by different windows) Instead: could you choose a single window (w3 with more columns =fine granular) and the nfilter out records to achieve the same result?
Or instead: df.groupBy(a,b,c).agg(sort_array(collect_list(foo,bar,baz)) and then an UDF which performs your desired aggregation Best, Georg Am Mo., 21. Okt. 2019 um 13:59 Uhr schrieb Rishi Shah < rishishah.s...@gmail.com>: > Hi All, > > Any suggestions? > > Thanks, > -Rishi > > On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah <rishishah.s...@gmail.com> > wrote: > >> Hi All, >> >> I have a use case where I need to perform nested windowing functions on a >> data frame to get final set of columns. Example: >> >> w1 = Window.partitionBy('col1') >> df = df.withColumn('sum1', F.sum('val')) >> >> w2 = Window.partitionBy('col1', 'col2') >> df = df.withColumn('sum2', F.sum('val')) >> >> w3 = Window.partitionBy('col1', 'col2', 'col3') >> df = df.withColumn('sum3', F.sum('val')) >> >> These 3 partitions are not huge at all, however the data size is 2T >> parquet snappy compressed. This throws a lot of outofmemory errors. >> >> I would like to get some advice around whether nested window functions is >> a good idea in pyspark? I wanted to avoid using multiple filter + joins to >> get to the final state, as join can create crazy shuffle. >> >> Any suggestions would be appreciated! >> >> -- >> Regards, >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah >