Yes. it results to a shuffle.
> On Dec 4, 2015, at 6:04 PM, Stephen Boesch <[email protected]> wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu <[email protected] > <mailto:[email protected]>>: > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count > > > the output looks like: > > month count > 201411 100 > 201412 200 > > > hopes help. > > > > > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hi there, > > > > I have my data stored in HDFS partitioned by month in Parquet format. > > The directory looks like this: > > > > -month=201411 > > -month=201412 > > -month=201501 > > -.... > > > > I want to compute some aggregates for every timestamp. > > How is it possible to achieve that by taking advantage of the existing > > partitioning? > > One naive way I am thinking is issuing multiple sql queries: > > > > SELECT * FROM TABLE WHERE month=201411 > > SELECT * FROM TABLE WHERE month=201412 > > SELECT * FROM TABLE WHERE month=201501 > > ..... > > > > computing the aggregates on the results of each query and combining them in > > the end. > > > > I think there should be a better way right? > > > > Thanks > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > <mailto:[email protected]> > For additional commands, e-mail: [email protected] > <mailto:[email protected]> > >
