In general,  RDDs get partitioned automatically without programmer
intervention. You generally don't need to worry about them unless you need
to adjust the size/number of partitions or the partitioning scheme
according to the needs of your application. Partitions get redistributed
among nodes whenever a shuffle occurs.  Repartitioning may cause a shuffle
to occur in some situations,  but it is not guaranteed to occur in all
cases.

In general,  smaller/more numerous partitions allow work to be distributed
among more workers,  but larger/fewer partitions allow work to be done in
larger chunks,  which may result in the work getting done more quickly as
long as all workers are kept busy, due to reduced overhead. Also,  the
number of partitions determines how many files get generated by actions
that save RDDs to files.

The maximum size of any one partition is ultimately limited by the
available memory of any single executor.

Rich
On Sep 22, 2015 6:42 PM, "XIANDI" <zxd_ci...@hotmail.com> wrote:

> I'm always confused by the partitions. We may have many RDDs in the code.
> Do
> we need to partition on all of them? Do the rdds get rearranged among all
> the nodes whenever we do a partition? What is a wise way of doing
> partitions?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-on-RDDs-tp24775.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to