Re: Avoid Shuffling on Partitioned Data

Fengdong Yu Fri, 04 Dec 2015 02:02:52 -0800

There are many ways, one simple is:

such as: you want to know how many rows for each month:


sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count


the output looks like:

month    count
201411    100
201412    200


hopes help.



> On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com> wrote:
> 
> Hi there,
> 
> I have my data stored in HDFS partitioned by month in Parquet format.
> The directory looks like this:
> 
> -month=201411
> -month=201412
> -month=201501
> -....
> 
> I want to compute some aggregates for every timestamp.
> How is it possible to achieve that by taking advantage of the existing 
> partitioning?
> One naive way I am thinking is issuing multiple sql queries:
> 
> SELECT * FROM TABLE WHERE month=201411
> SELECT * FROM TABLE WHERE month=201412
> SELECT * FROM TABLE WHERE month=201501
> .....
> 
> computing the aggregates on the results of each query and combining them in 
> the end.
> 
> I think there should be a better way right?
> 
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Avoid Shuffling on Partitioned Data

Reply via email to