That's the default shuffle partitions with Spark, You can tune it using spark.sql.shuffle.partitions.
Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Tue, Aug 16, 2016 at 11:31 AM, Niranda Perera <niranda.per...@gmail.com> wrote: > Hi, > > I ran the following simple code in a pseudo distributed spark cluster. > > *object ScalaDummy {* > * def main(args: Array[String]) {* > * val appName = "testApp " + Calendar.getInstance().getTime* > * val conf: SparkConf = new > SparkConf().setMaster("spark://nira-wso2:7077").setAppName(appName)* > > * val propsFile: String = > Thread.currentThread.getContextClassLoader.getResource("spark-defaults.conf").getPath* > * val properties: Map[String, String] = > Utils.getPropertiesFromFile(propsFile)* > * conf.setAll(properties)* > > * val sqlCtx: SQLContext = new SQLContext(new JavaSparkContext(conf))* > > * sqlCtx.createDataFrame(Seq(* > * (83, 0, 38),* > * (26, 0, 79),* > * (43, 81, 24))).toDF("a", "b", "c").registerTempTable("table1")* > > * val dataFrame: DataFrame = sqlCtx.sql("select count(*) from table1 > group by a, b")* > * val partitionCount = dataFrame.rdd.partitions.length* > * dataFrame.show* > > * println("Press any key to exit!")* > * Console.readLine()* > * }* > *}* > > I observed that the resultant dataFrame encapsulated a MapPartitionsRDD > with 200 partitions. Please see the attached screenshots. > > I'm wondering where this 200 partitions came from? Is this the default > number of partitions for MapPartitionsRDD? or is there any way to control > this? > The result only contains 3 rows. So, IMO having 200 partitions for this is > not so efficient because 197< tasks are run on empty partitions, from what > I could see. > > Am I doing something wrong here or is this the expected behavior after a > 'group by' query? > > Look forward to hearing from you! > > Best > -- > Niranda Perera > @n1r44 <https://twitter.com/N1R44> > +94 71 554 8430 > https://www.linkedin.com/in/niranda > https://pythagoreanscript.wordpress.com/ > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >