Hi, thank you Pedro I tested maximizeResourceAllocation option. When it's enabled, it seems Spark utilized their cores fully. However the performance is not so different from default setting.
I consider to use s3-distcp for uploading files. And, I think table(dataframe) caching is also effectiveness. Regards, Hiroyuki 2019年2月2日(土) 1:12 Pedro Tuero <tuerope...@gmail.com>: > Hi Hiroyuki, thanks for the answer. > > I found a solution for the cores per executor configuration: > I set this configuration to true: > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation > Probably it was true by default at version 5.16, but I didn't find when it > has changed. > In the same link, it says that dynamic allocation is true by default. I > thought it would do the trick but reading again I think it is related to > the number of executors rather than the number of cores. > > But the jobs are still taking more than before. > Watching application history, I see these differences: > For the same job, the same kind of instances types, default (aws managed) > configuration for executors, cores, and memory: > Instances: > 6 r5.xlarge : 4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances * > 4 cores). > > With 5.16: > - 24 executors (4 in each instance, including the one who also had the > driver). > - 4 cores each. > - 2.7 * 2 (Storage + on-heap storage) memory each. > - 1 executor per core, but at the same time 4 cores per executor (?). > - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4) > - Total Elapsed Time: 6 minutes > With 5.20: > - 5 executors (1 in each instance, 0 in the instance with the driver). > - 4 cores each. > - 11.9 * 2 (Storage + on-heap storage) memory each. > - Total Mem in executors per Instance : 23.8 (11.9 * 2 * 1) > - Total Elapsed Time: 8 minutes > > > I don't understand the configuration of 5.16, but it works better. > It seems that in 5.20, a full instance is wasted with the driver only, > while it could also contain an executor. > > > Regards, > Pedro. > > > > l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <idiotpan...@gmail.com> > escribió: > >> Hi, Pedro >> >> >> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for >> performance tuning. >> >> Do you configure dynamic allocation ? >> >> FYI: >> >> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation >> >> I've not tested it yet. I guess spark-submit needs to specify number of >> executors. >> >> Regards, >> Hiroyuki >> >> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ: >> >>> Hi guys, >>> I use to run spark jobs in Aws emr. >>> Recently I switch from aws emr label 5.16 to 5.20 (which use Spark >>> 2.4.0). >>> I've noticed that a lot of steps are taking longer than before. >>> I think it is related to the automatic configuration of cores by >>> executor. >>> In version 5.16, some executors toke more cores if the instance allows >>> it. >>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured >>> by executor was 10gb, then aws emr automatically assigned 2 cores by >>> executor. >>> Now in label 5.20, unless I configure the number of cores manually, only >>> one core is assigned per executor. >>> >>> I don't know if it is related to Spark 2.4.0 or if it is something >>> managed by aws... >>> Does anyone know if there is a way to automatically use more cores when >>> it is physically possible? >>> >>> Thanks, >>> Peter. >>> >>