Good question.  What I have read about is that Spark is not a magician and
can't know how many tasks will be better for your input, so it can fail.
Spark set the default parallelism as twice the number of cores on the
cluster.
In my jobs, it seemed that using the parallelism inherited from input parts
worked well sometimes, and it was 100x the default parallelism.
When every job started to use default parallelism (apparently when
switching from emr 5.16 to 5.20), I first tried to do some repartitions but
in some cases, it was the same: The repartition job took as long as the job
I wanted to affect (or failed directly).
Doing the repartition inside some operation on Rdd pairs worked really
better (https: //
stackoverflow.com/questions/43027306/is-there-an-effective-partitioning-method-when-using-reducebykey-in-spark
).

It will be nice to have a more comprehensive look at which Rdds should need
more or less parallelism.

Regards,
Pedro.

El sáb., 23 de feb. de 2019 a la(s) 21:27, Yeikel (em...@yeikel.com)
escribió:

> I am following up on this question because I have a similar issue.
>
> When is that we need to control the parallelism manually? Skewed
> partitions?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to