If all queues are identical, this behavior should not be happening. Preemption as designed in fair scheduler (IIRC) takes place based on the instantaneous fair share, not the steady state fair share. The fair scheduler docs <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html> aren't super helpful on this but it does say in the Monitoring section that preemption won't take place if you're less than your instantaneous fair share (which might imply that it would occur if you were over your inst. fair share and someone had requested resources). The code for FairScheduler.resToPreempt <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29> also seems to use getFairShare rather than getSteadyFairShare() for preemption so that would imply that it is using instantaneous fair share rather than steady state.
Could you share your YARN site/fair-scheduler and Spark configurations? Could you also share the YARN Scheduler UI (specifically the top of of the RM which shows how many resources are in use)? Since it's not likely due to steady state fair share, some other possible reasons why this might be taking place (this is not remotely conclusive but with no information this is what comes to mind): - You're not reaching yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps due to core/memory ratio inconsistency with the cluster. - Your second job doesn't have a sufficient level of parallelism to request more executors than what it is recieving (perhaps there are fewer than 13 tasks at any point in time) and you don't have spark.dynamicAllocation.minExecutors set? -Hamel On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Hi All, > > A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler > with root queue having 230 queues. > > Each Queue is configured with maxResources equal to Total Cluster > Resource. When a Spark job is submitted into a queue A, it is given with > 10TB, 3000 cores according to instantaneous Fair Share and it is holding > the entire resource without releasing. After some time, when another job is > submitted into other queue B, it will get the Fair Share 45GB and 13 cores > i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are > submitted into queue B, all the jobs in B has to share the 45GB and 13 > cores. Whereas the job which is in queue A holds the entire cluster > resource affecting the other jobs. > This kind of issue often happens when a Spark job submitted first > which holds the entire cluster resource. What is the best way to fix this > issue. Can we make preemption to happen for instantaneous fair share > instead of fair share, will it help. > > Note: > > 1. We do not want to give weight for particular queue. Because all the 240 > queues are critical. > 2. Changing the queues into nested does not solve the issue. > 3. Adding maxResource to queue won't allow the first job to pick entire > cluster resource, but still configuring the optimal maxResource for 230 > queue is difficult and also the first job can't use the entire cluster > resource when the cluster is idle. > 4. We do not want to handle it in Spark ApplicationMaster, then we need to > check for other new YARN application type with similar behavior. We want > YARN to control this behavior by killing the resources which is hold by > first job for longer period. > > > Thanks, > Prabhu Joseph > >