If all queues are identical, this behavior should not be happening.
Preemption as designed in fair scheduler (IIRC) takes place based on the
instantaneous fair share, not the steady state fair share. The fair
scheduler docs
<https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
aren't super helpful on this but it does say in the Monitoring section that
preemption won't take place if you're less than your instantaneous fair
share (which might imply that it would occur if you were over your inst.
fair share and someone had requested resources). The code for
FairScheduler.resToPreempt
<http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29>
also seems to use getFairShare rather than getSteadyFairShare() for
preemption so that would imply that it is using instantaneous fair share
rather than steady state.

Could you share your YARN site/fair-scheduler and Spark configurations?
Could you also share the YARN Scheduler UI (specifically the top of of the
RM which shows how many resources are in use)?

Since it's not likely due to steady state fair share, some other possible
reasons why this might be taking place (this is not remotely conclusive but
with no information this is what comes to mind):
- You're not reaching
yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps due
to core/memory ratio inconsistency with the cluster.
- Your second job doesn't have a sufficient level of parallelism to request
more executors than what it is recieving (perhaps there are fewer than 13
tasks at any point in time) and you don't have
spark.dynamicAllocation.minExecutors set?

-Hamel

On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <prabhujose.ga...@gmail.com>
wrote:

> Hi All,
>
>  A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler
> with root queue having 230 queues.
>
>     Each Queue is configured with maxResources equal to Total Cluster
> Resource. When a Spark job is submitted into a queue A, it is given with
> 10TB, 3000 cores according to instantaneous Fair Share and it is holding
> the entire resource without releasing. After some time, when another job is
> submitted into other queue B, it will get the Fair Share 45GB and 13 cores
> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are
> submitted into queue B, all the jobs in B has to share the 45GB and 13
> cores. Whereas the job which is in queue A holds the entire cluster
> resource affecting the other jobs.
>      This kind of issue often happens when a Spark job submitted first
> which holds the entire cluster resource. What is the best way to fix this
> issue. Can we make preemption to happen for instantaneous fair share
> instead of fair share, will it help.
>
> Note:
>
> 1. We do not want to give weight for particular queue. Because all the 240
> queues are critical.
> 2. Changing the queues into nested does not solve the issue.
> 3. Adding maxResource to queue  won't allow the first job to pick entire
> cluster resource, but still configuring the optimal maxResource for 230
> queue is difficult and also the first job can't use the entire cluster
> resource when the cluster is idle.
> 4. We do not want to handle it in Spark ApplicationMaster, then we need to
> check for other new YARN application type with similar behavior. We want
> YARN to control this behavior by killing the resources which is hold by
> first job for longer period.
>
>
> Thanks,
> Prabhu Joseph
>
>

Reply via email to