Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-31 Thread Gourav Sengupta
Hi, just to elaborate what Ranadip has pointed out here correctly, gzip files are read only by one executor, where as a bzip file can be read by multiple executors therefore their reading speed will be parallelised and higher. try to use bzip2 for kafka connect. Regards, Gourav Sengupta On Mon,

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-30 Thread Ranadip Chatterjee
Gzip files are not splittable. Hence using very large (i.e. non partitioned) gzip files lead to contention at reading the files as readers cannot scale beyond the number of gzip files to read. Better to use a splittable compression format instead to allow frameworks to scale up. Or manually manage

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-30 Thread Ori Popowski
Thanks. Eventually the problem was solved. I am still not 100% sure what caused it but when I said the input was identical I simplified a bit because it was not (sorry for misleading, I thought this information would just be noise). Explanation: the input to the EMR job was gzips created by Fireho

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-27 Thread Aniket Mokashi
+cloud-dataproc-discuss On Wed, May 25, 2022 at 12:33 AM Ranadip Chatterjee wrote: > To me, it seems like the data being processed on the 2 systems is not > identical. Can't think of any other reason why the single task stage will > get a different number of input records in the 2 cases. 700gb o

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-25 Thread Ranadip Chatterjee
To me, it seems like the data being processed on the 2 systems is not identical. Can't think of any other reason why the single task stage will get a different number of input records in the 2 cases. 700gb of input to a single task is not good, and seems to be the bottleneck. On Wed, 25 May 2022,

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-24 Thread Ori Popowski
Hi, Both jobs use spark.dynamicAllocation.enabled so there's no need to change the number of executors. There are 702 executors in the Dataproc cluster so this is not the problem. About number of partitions - this I didn't change and it's still 400. While writing this now, I am realising that I ha

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-24 Thread Ranadip Chatterjee
Hi Ori, A single task for the final step can result from various scenarios like an aggregate operation that results in only 1 value (e.g count) or a key based aggregate with only 1 key for example. There could be other scenarios as well. However, that would be the case in both EMR and Dataproc if