Hi,
thank you Pedro

I tested maximizeResourceAllocation option. When it's enabled, it seems
Spark utilized their cores fully. However the performance is not so
different from default setting.

I consider to use s3-distcp for uploading files. And, I think
table(dataframe) caching is also effectiveness.

Regards,
Hiroyuki

2019年2月2日(土) 1:12 Pedro Tuero <tuerope...@gmail.com>:

> Hi Hiroyuki, thanks for the answer.
>
> I found a solution for the cores per executor configuration:
> I set this configuration to true:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
> Probably it was true by default at version 5.16, but I didn't find when it
> has changed.
> In the same link, it says that dynamic allocation is true by default. I
> thought it would do the trick but reading again I think it is related to
> the number of executors rather than the number of cores.
>
> But the jobs are still taking more than before.
> Watching application history,  I see these differences:
> For the same job, the same kind of instances types, default (aws managed)
> configuration for executors, cores, and memory:
> Instances:
> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
> 4 cores).
>
> With 5.16:
> - 24 executors  (4 in each instance, including the one who also had the
> driver).
> - 4 cores each.
> - 2.7  * 2 (Storage + on-heap storage) memory each.
> - 1 executor per core, but at the same time  4 cores per executor (?).
> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
> - Total Elapsed Time: 6 minutes
> With 5.20:
> - 5 executors (1 in each instance, 0 in the instance with the driver).
> - 4 cores each.
> - 11.9  * 2 (Storage + on-heap storage) memory each.
> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
> - Total Elapsed Time: 8 minutes
>
>
> I don't understand the configuration of 5.16, but it works better.
> It seems that in 5.20, a full instance is wasted with the driver only,
> while it could also contain an executor.
>
>
> Regards,
> Pedro.
>
>
>
> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <idiotpan...@gmail.com>
> escribió:
>
>> Hi, Pedro
>>
>>
>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>> performance tuning.
>>
>> Do you configure dynamic allocation ?
>>
>> FYI:
>>
>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>> I've not tested it yet. I guess spark-submit needs to specify number of
>> executors.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>>
>>> Hi guys,
>>> I use to run spark jobs in Aws emr.
>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>> 2.4.0).
>>> I've noticed that a lot of steps are taking longer than before.
>>> I think it is related to the automatic configuration of cores by
>>> executor.
>>> In version 5.16, some executors toke more cores if the instance allows
>>> it.
>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>> executor.
>>> Now in label 5.20, unless I configure the number of cores manually, only
>>> one core is assigned per executor.
>>>
>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>> managed by aws...
>>> Does anyone know if there is a way to automatically use more cores when
>>> it is physically possible?
>>>
>>> Thanks,
>>> Peter.
>>>
>>

Reply via email to