Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai
Hi Neal, spark.sql.shuffle.partitions is the property to control the number of tasks after shuffle (to generate t2, there is a shuffle for the aggregations specified by groupBy and agg.) You can use sqlContext.setConf("spark.sql.shuffle.partitions", "newNumber") or sqlContext.sql("set spark.sql.sh

dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Neal Yin
I have some trouble to control number of spark tasks for a stage. This on latest spark 1.3.x source code build. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) sc.getConf.get("spark.default.parallelism") -> setup to 10 val t1 = hiveContext.sql("FROM SalesJan2009 select * ") val