In general, RDDs get partitioned automatically without programmer intervention. You generally don't need to worry about them unless you need to adjust the size/number of partitions or the partitioning scheme according to the needs of your application. Partitions get redistributed among nodes whenever a shuffle occurs. Repartitioning may cause a shuffle to occur in some situations, but it is not guaranteed to occur in all cases.
In general, smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead. Also, the number of partitions determines how many files get generated by actions that save RDDs to files. The maximum size of any one partition is ultimately limited by the available memory of any single executor. Rich On Sep 22, 2015 6:42 PM, "XIANDI" <zxd_ci...@hotmail.com> wrote: > I'm always confused by the partitions. We may have many RDDs in the code. > Do > we need to partition on all of them? Do the rdds get rearranged among all > the nodes whenever we do a partition? What is a wise way of doing > partitions? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-on-RDDs-tp24775.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >