Re: spark on yarn wastes one box (or 1 GB on each box) for am container

Alexander Pivovarov Tue, 09 Feb 2016 12:37:16 -0800

Thanks Jonathan

Actually I'd like to use maximizeResourceAllocation.


Ideally for me would be to add new instance group having single small box
labelled as AM
I'm not sure "aws emr create-cluster" supports setting custom LABELS , the
only settings awailable are:

InstanceCount=1,BidPrice=0.5,Name=sparkAM,InstanceGroupType=TASK,InstanceType=m3.xlarge


How can I specify yarn label AM for that box?



On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly <[email protected]>
wrote:

> Interesting, I was not aware of spark.yarn.am.nodeLabelExpression.
>
> We do use YARN labels on EMR; each node is automatically labeled with its
> type (MASTER, CORE, or TASK). And we do
> set yarn.app.mapreduce.am.labels=CORE in yarn-site.xml, but we do not set
> spark.yarn.am.nodeLabelExpression.
>
> Does Spark somehow not actually honor this? It seems weird that Spark
> would have its own similar-sounding property
> (spark.yarn.am.nodeLabelExpression). If spark.yarn.am.nodeLabelExpression
> is used and yarn.app.mapreduce.am.labels ignored, I could be wrong about
> Spark AMs only running on CORE instances in EMR.
>
> I'm guessing though that spark.yarn.am.nodeLabelExpression would simply
> override yarn.app.mapreduce.am.labels, so yarn.app.mapreduce.am.labels
> would be treated as a default when it is set and
> spark.yarn.am.nodeLabelExpression is not. Is that correct?
>
> In short, Alex, you should not need to set any of the label-related
> properties yourself if you do what I suggested regarding using small CORE
> instances and large TASK instances. But if you want to do something
> different, it would also be possible to add a TASK instance group with
> small nodes and configured with some new label. Then you could set
> spark.yarn.am.nodeLabelExpression to that label.
>
> Thanks, Marcelo, for pointing out spark.yarn.am.nodeLabelExpression!
>
> ~ Jonathan
>
> On Tue, Feb 9, 2016 at 9:54 AM Marcelo Vanzin <[email protected]> wrote:
>
>> You should be able to use spark.yarn.am.nodeLabelExpression if your
>> version of YARN supports node labels (and you've added a label to the
>> node where you want the AM to run).
>>
>> On Tue, Feb 9, 2016 at 9:51 AM, Alexander Pivovarov
>> <[email protected]> wrote:
>> > Am container starts first and yarn selects random computer to run it.
>> >
>> > Is it possible to configure yarn so that it selects small computer for
>> am
>> > container.
>> >
>> > On Feb 9, 2016 12:40 AM, "Sean Owen" <[email protected]> wrote:
>> >>
>> >> If it's too small to run an executor, I'd think it would be chosen for
>> >> the AM as the only way to satisfy the request.
>> >>
>> >> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov
>> >> <[email protected]> wrote:
>> >> > If I add additional small box to the cluster can I configure yarn to
>> >> > select
>> >> > small box to run am container?
>> >> >
>> >> >
>> >> > On Mon, Feb 8, 2016 at 10:53 PM, Sean Owen <[email protected]>
>> wrote:
>> >> >>
>> >> >> Typically YARN is there because you're mediating resource requests
>> >> >> from things besides Spark, so yeah using every bit of the cluster
>> is a
>> >> >> little bit of a corner case. There's not a good answer if all your
>> >> >> nodes are the same size.
>> >> >>
>> >> >> I think you can let YARN over-commit RAM though, and allocate more
>> >> >> memory than it actually has. It may be beneficial to let them all
>> >> >> think they have an extra GB, and let one node running the AM
>> >> >> technically be overcommitted, a state which won't hurt at all unless
>> >> >> you're really really tight on memory, in which case something might
>> >> >> get killed.
>> >> >>
>> >> >> On Tue, Feb 9, 2016 at 6:49 AM, Jonathan Kelly <
>> [email protected]>
>> >> >> wrote:
>> >> >> > Alex,
>> >> >> >
>> >> >> > That's a very good question that I've been trying to answer myself
>> >> >> > recently
>> >> >> > too. Since you've mentioned before that you're using EMR, I assume
>> >> >> > you're
>> >> >> > asking this because you've noticed this behavior on emr-4.3.0.
>> >> >> >
>> >> >> > In this release, we made some changes to the
>> >> >> > maximizeResourceAllocation
>> >> >> > (which you may or may not be using, but either way this issue is
>> >> >> > present),
>> >> >> > including the accidental inclusion of somewhat of a bug that
>> makes it
>> >> >> > not
>> >> >> > reserve any space for the AM, which ultimately results in one of
>> the
>> >> >> > nodes
>> >> >> > being utilized only by the AM and not an executor.
>> >> >> >
>> >> >> > However, as you point out, the only viable fix seems to be to
>> reserve
>> >> >> > enough
>> >> >> > memory for the AM on *every single node*, which in some cases
>> might
>> >> >> > actually
>> >> >> > be worse than wasting a lot of memory on a single node.
>> >> >> >
>> >> >> > So yeah, I also don't like either option. Is this just the price
>> you
>> >> >> > pay
>> >> >> > for
>> >> >> > running on YARN?
>> >> >> >
>> >> >> >
>> >> >> > ~ Jonathan
>> >> >> >
>> >> >> > On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov
>> >> >> > <[email protected]>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Lets say that yarn has 53GB memory available on each slave
>> >> >> >>
>> >> >> >> spark.am container needs 896MB.  (512 + 384)
>> >> >> >>
>> >> >> >> I see two options to configure spark:
>> >> >> >>
>> >> >> >> 1. configure spark executors to use 52GB and leave 1 GB on each
>> box.
>> >> >> >> So,
>> >> >> >> some box will also run am container. So, 1GB memory will not be
>> used
>> >> >> >> on
>> >> >> >> all
>> >> >> >> slaves but one.
>> >> >> >>
>> >> >> >> 2. configure spark to use all 53GB and add additional 53GB box
>> which
>> >> >> >> will
>> >> >> >> run only am container. So, 52GB on this additional box will do
>> >> >> >> nothing
>> >> >> >>
>> >> >> >> I do not like both options. Is there a better way to configure
>> >> >> >> yarn/spark?
>> >> >> >>
>> >> >> >>
>> >> >> >> Alex
>> >> >
>> >> >
>>
>>
>>
>> --
>> Marcelo
>>
>

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

Reply via email to