Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
Yes. it results to a shuffle. > On Dec 4, 2015, at 6:04 PM, Stephen Boesch wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu >: > There are many ways, one simple i

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch
@Yu Fengdong: Your approach - specifically the groupBy results in a shuffle does it not? 2015-12-04 2:02 GMT-08:00 Fengdong Yu : > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > > sqlContext.read.parquet(“……../month=*”).select($“month").groupB

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
There are many ways, one simple is: such as: you want to know how many rows for each month: sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count the output looks like: monthcount 201411100 201412200 hopes help. > On Dec 4, 2015, at 5:53 PM, Yiannis