There are many ways, one simple is: such as: you want to know how many rows for each month:
sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count the output looks like: month count 201411 100 201412 200 hopes help. > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com> wrote: > > Hi there, > > I have my data stored in HDFS partitioned by month in Parquet format. > The directory looks like this: > > -month=201411 > -month=201412 > -month=201501 > -.... > > I want to compute some aggregates for every timestamp. > How is it possible to achieve that by taking advantage of the existing > partitioning? > One naive way I am thinking is issuing multiple sql queries: > > SELECT * FROM TABLE WHERE month=201411 > SELECT * FROM TABLE WHERE month=201412 > SELECT * FROM TABLE WHERE month=201501 > ..... > > computing the aggregates on the results of each query and combining them in > the end. > > I think there should be a better way right? > > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org