To me, it seems like the data being processed on the 2 systems is not
identical. Can't think of any other reason why the single task stage will
get a different number of input records in the 2 cases. 700gb of input to a
single task is not good, and seems to be the bottleneck.

On Wed, 25 May 2022, 06:32 Ori Popowski, <ori....@gmail.com> wrote:

> Hi,
>
> Both jobs use spark.dynamicAllocation.enabled so there's no need to
> change the number of executors. There are 702 executors in the Dataproc
> cluster so this is not the problem.
> About number of partitions - this I didn't change and it's still 400.
> While writing this now, I am realising that I have more partitions than
> executors, but the same situation applies to EMR.
>
> I am observing 1 task in the final stage also on EMR. The difference is
> that on EMR that task receives 50K volume of data and on Dataproc it
> receives 700gb. I don't understand why it's happening. It can mean that the
> graph is different. But the job is exactly the same. Could it be because
> the minor version of Spark is different?
>
> On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee <ranadi...@gmail.com>
> wrote:
>
>> Hi Ori,
>>
>> A single task for the final step can result from various scenarios like
>> an aggregate operation that results in only 1 value (e.g count) or a key
>> based aggregate with only 1 key for example. There could be other scenarios
>> as well. However, that would be the case in both EMR and Dataproc if the
>> same code is run on the same data in both cases.
>>
>> On a separate note, since you have now changed the size and number of
>> nodes, you may need to re-optimize the number and size of executors for the
>> job and perhaps the number of partitions as well to optimally use the
>> cluster resources.
>>
>> Regards,
>> Ranadip
>>
>> On Tue, 24 May 2022, 10:45 Ori Popowski, <ori....@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark 2.4.8.
>>> I am creating a cluster with the exact same configuration, where the only
>>> difference is that the original cluster uses 78 workers with 96 CPUs and
>>> 768GiB memory each, and in the new cluster I am using 117 machines with 64
>>> CPUs and 512GiB each, to achieve the same amount of resources in the
>>> cluster.
>>>
>>> The job is run with the same configuration (num of partitions,
>>> parallelism, etc.) and reads the same data. However, something strange
>>> happens and the job takes 20 hours. What I observed is that there is a
>>> stage where the driver instantiates a single task, and this task never
>>> starts because the shuffle of moving all the data to it takes forever.
>>>
>>> I also compared the runtime configuration and found some minor
>>> differences (due to Dataproc being different from EMR) but I haven't found
>>> any substantial difference.
>>>
>>> In other stages the cluster utilizes all the partitions (400), and it's
>>> not clear to me why it decides to invoke a single task.
>>>
>>> Can anyone provide an insight as to why such a thing would happen?
>>>
>>> Thanks
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Reply via email to