I tried caching the parent data set but it slows down the execution time, last 
column in the input data set is double array and requirement is to add last 
column double array after doing group by. I have implemented an aggregation 
function which adds the last column. Hence the query is 

Select count(*), col1, col2, col3, aggregationFunction(doublecol) from table 
group by col1,col2,col3 having count(*) >1

The about queries group by columns will change similarly I have to run 100 
queries on same data set.

Best Regards,
Anil Langote
+1-425-633-9747

> On Dec 21, 2016, at 11:41 AM, Anil Langote <anillangote0...@gmail.com> wrote:
> 
> Hi All,
> 
> I have an requirement where I have to run 100 group by queries with different 
> columns I have generated the parquet file which has 30 columns I see every 
> parquet files has different size and 200 files are generated, my question is 
> what is the best approach to run group by queries on parquet files more files 
> are recommend or I should create less files to get better performance.  
> 
> Right now with 2 cores and 65 executors on 4 node cluster with 320 cores 
> available spark is taking average 1.4 mins to finish one query we want to 
> tune the time around 30 or 40 seconds for one query the hdfs block size 128MB 
> and spark is launching 2400 tasks the partitions for the input dataset is 
> 2252.
> 
> I have implemented the threading in spark driver to launch all these queries 
> at the same time with fair scheduled enabled however I see most of times jobs 
> are running sequentially.
> 
> Any input in this regard is appreciated.
> 
> Best Regards,
> Anil Langote
> +1-425-633-9747

Reply via email to