Re: Resultant RDD after a group by query always returns 200 partitions

Rishi Mishra Tue, 16 Aug 2016 01:47:58 -0700

That's the default shuffle partitions with Spark, You can tune it using
spark.sql.shuffle.partitions.


Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Tue, Aug 16, 2016 at 11:31 AM, Niranda Perera <niranda.per...@gmail.com>
wrote:

> Hi,
>
> I ran the following simple code in a pseudo distributed spark cluster.
>
> *object ScalaDummy {*
> *  def main(args: Array[String]) {*
> *    val appName = "testApp " + Calendar.getInstance().getTime*
> *    val conf: SparkConf = new
> SparkConf().setMaster("spark://nira-wso2:7077").setAppName(appName)*
>
> *    val propsFile: String =
> Thread.currentThread.getContextClassLoader.getResource("spark-defaults.conf").getPath*
> *    val properties: Map[String, String] =
> Utils.getPropertiesFromFile(propsFile)*
> *    conf.setAll(properties)*
>
> *    val sqlCtx: SQLContext = new SQLContext(new JavaSparkContext(conf))*
>
> *    sqlCtx.createDataFrame(Seq(*
> *      (83, 0, 38),*
> *      (26, 0, 79),*
> *      (43, 81, 24))).toDF("a", "b", "c").registerTempTable("table1")*
>
> *    val dataFrame: DataFrame = sqlCtx.sql("select count(*) from table1
> group by a, b")*
> *    val partitionCount = dataFrame.rdd.partitions.length*
> *    dataFrame.show*
>
> *    println("Press any key to exit!")*
> *    Console.readLine()*
> *  }*
> *}*
>
> I observed that the resultant dataFrame encapsulated a MapPartitionsRDD
> with 200 partitions. Please see the attached screenshots.
>
> I'm wondering where this 200 partitions came from? Is this the default
> number of partitions for MapPartitionsRDD? or is there any way to control
> this?
> The result only contains 3 rows. So, IMO having 200 partitions for this is
> not so efficient because 197< tasks are run on empty partitions, from what
> I could see.
>
> Am I doing something wrong here or is this the expected behavior after a
> 'group by' query?
>
> Look forward to hearing from you!
>
> Best
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

Re: Resultant RDD after a group by query always returns 200 partitions

Reply via email to