RE: forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf
Could you explain why this would work? Assaf. From: Haviv, Daniel [mailto:dha...@amazon.com] Sent: Sunday, January 29, 2017 7:09 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: forcing dataframe groupby partitioning If there's no built in local groupBy, You could do something

Re: forcing dataframe groupby partitioning

2017-01-29 Thread Haviv, Daniel
If there's no built in local groupBy, You could do something like that: df.groupby(C1,C2).agg(...).flatmap(x=>x.groupBy(C1)).agg Thank you. Daniel On 29 Jan 2017, at 18:33, Mendelson, Assaf mailto:assaf.mendel...@rsa.com>> wrote: Hi, Consider the following example: df.groupby(C1,C2).agg(s

forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf
Hi, Consider the following example: df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg) The default way spark would behave would be to shuffle according to a combination of C1 and C2 and then shuffle again by C1 only. This behavior makes sense when one uses C2 to salt C1 for skew