Re: Spark Job on YARN Hogging the entire Cluster resource

Hamel Kothari Wed, 24 Feb 2016 12:00:12 -0800

The instantaneous fair share is what Queue B should get according to the
code (and my experience). Assuming your queues are all equal it would be
10TB/2.


I can't help much more unless I can see your config files and ideally also
the YARN Scheduler UI to get an idea of what your queues/actual resource
usage is like. Logs from each of your Spark applications would also be
useful. Basically the more info the better.

On Wed, Feb 24, 2016 at 2:52 PM Prabhu Joseph <prabhujose.ga...@gmail.com>
wrote:

> Hi Hamel,
>
>     Thanks for looking into the issue. What i am not understanding is,
> after preemption what is the share that the second queue gets in case if
> the first queue holds the entire cluster resource without releasing, is it
> instantaneous fair share or fair share.
>
>      Queue A and B are there (total 230 queues), total cluster resource is
> 10TB, 3000 cores. If a job submitted into queue A, it will get 10TB, 3000
> cores and it is not releasing any resource. Now if a second job submitted
> into queue B, so preemption definitely will happen, but what is the share
> queue B will get after preemption. *Is it  <10 TB , 3000> / 2 or
> <10TB,3000> / 230*
>
> We find, after preemption queue B gets only <10TB,3000> / 230, because the
> first job is holding the resource. In case if first job releases the
> resource, the second queue will get <10TB,3000> /2 based on higher priority
> and reservation.
>
> The question is how much preemption tries to preempt the queue A if it
> holds the entire resource without releasing? Could not able to share the
> actual configuration, but the answer to the question here will help us.
>
>
> Thanks,
> Prabhu Joseph
>
>
>
>
> On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkoth...@gmail.com>
> wrote:
>
>> If all queues are identical, this behavior should not be happening.
>> Preemption as designed in fair scheduler (IIRC) takes place based on the
>> instantaneous fair share, not the steady state fair share. The fair
>> scheduler docs
>> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
>> aren't super helpful on this but it does say in the Monitoring section that
>> preemption won't take place if you're less than your instantaneous fair
>> share (which might imply that it would occur if you were over your inst.
>> fair share and someone had requested resources). The code for
>> FairScheduler.resToPreempt
>> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29>
>> also seems to use getFairShare rather than getSteadyFairShare() for
>> preemption so that would imply that it is using instantaneous fair share
>> rather than steady state.
>>
>> Could you share your YARN site/fair-scheduler and Spark configurations?
>> Could you also share the YARN Scheduler UI (specifically the top of of the
>> RM which shows how many resources are in use)?
>>
>> Since it's not likely due to steady state fair share, some other possible
>> reasons why this might be taking place (this is not remotely conclusive but
>> with no information this is what comes to mind):
>> - You're not reaching
>> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps
>> due to core/memory ratio inconsistency with the cluster.
>> - Your second job doesn't have a sufficient level of parallelism to
>> request more executors than what it is recieving (perhaps there are fewer
>> than 13 tasks at any point in time) and you don't have
>> spark.dynamicAllocation.minExecutors set?
>>
>> -Hamel
>>
>> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <prabhujose.ga...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>>  A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler
>>> with root queue having 230 queues.
>>>
>>>     Each Queue is configured with maxResources equal to Total Cluster
>>> Resource. When a Spark job is submitted into a queue A, it is given with
>>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding
>>> the entire resource without releasing. After some time, when another job is
>>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores
>>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are
>>> submitted into queue B, all the jobs in B has to share the 45GB and 13
>>> cores. Whereas the job which is in queue A holds the entire cluster
>>> resource affecting the other jobs.
>>>      This kind of issue often happens when a Spark job submitted first
>>> which holds the entire cluster resource. What is the best way to fix this
>>> issue. Can we make preemption to happen for instantaneous fair share
>>> instead of fair share, will it help.
>>>
>>> Note:
>>>
>>> 1. We do not want to give weight for particular queue. Because all the
>>> 240 queues are critical.
>>> 2. Changing the queues into nested does not solve the issue.
>>> 3. Adding maxResource to queue  won't allow the first job to pick entire
>>> cluster resource, but still configuring the optimal maxResource for 230
>>> queue is difficult and also the first job can't use the entire cluster
>>> resource when the cluster is idle.
>>> 4. We do not want to handle it in Spark ApplicationMaster, then we need
>>> to check for other new YARN application type with similar behavior. We want
>>> YARN to control this behavior by killing the resources which is hold by
>>> first job for longer period.
>>>
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>>
>

Re: Spark Job on YARN Hogging the entire Cluster resource

Reply via email to