Re: Spark 2.4 partitions and tasks

Pedro Tuero Fri, 08 Feb 2019 07:58:51 -0800

128 is the default parallelism defined for the cluster.
The question now is why keyBy operation is using default parallelism
instead of the number of partition of the RDD created by the previous step
(5580).
Any clues?


El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tuerope...@gmail.com)
escribió:

> Hi,
> I am running a job in spark (using aws emr) and some stages are taking a
> lot more using spark  2.4 instead of Spark 2.3.1:
>
> Spark 2.4:
> [image: image.png]
>
> Spark 2.3.1:
> [image: image.png]
>
> With Spark 2.4, the keyBy operation take more than 10X what it took with
> Spark 2.3.1
> It seems to be related to the number of tasks / partitions.
>
> Questions:
> - Is it not supposed that the number of task of a job is related to number
> of parts of the RDD left by the previous job? Did that change in version
> 2.4??
> - Which tools/ configuration may I try, to reduce this aberrant downgrade
> of performance??
>
> Thanks.
> Pedro.
>

Re: Spark 2.4 partitions and tasks

Reply via email to