Good question. What I have read about is that Spark is not a magician and can't know how many tasks will be better for your input, so it can fail. Spark set the default parallelism as twice the number of cores on the cluster. In my jobs, it seemed that using the parallelism inherited from input parts worked well sometimes, and it was 100x the default parallelism. When every job started to use default parallelism (apparently when switching from emr 5.16 to 5.20), I first tried to do some repartitions but in some cases, it was the same: The repartition job took as long as the job I wanted to affect (or failed directly). Doing the repartition inside some operation on Rdd pairs worked really better (https: // stackoverflow.com/questions/43027306/is-there-an-effective-partitioning-method-when-using-reducebykey-in-spark ).
It will be nice to have a more comprehensive look at which Rdds should need more or less parallelism. Regards, Pedro. El sáb., 23 de feb. de 2019 a la(s) 21:27, Yeikel (em...@yeikel.com) escribió: > I am following up on this question because I have a similar issue. > > When is that we need to control the parallelism manually? Skewed > partitions? > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >