The instantaneous fair share is what Queue B should get according to the code (and my experience). Assuming your queues are all equal it would be 10TB/2.
I can't help much more unless I can see your config files and ideally also the YARN Scheduler UI to get an idea of what your queues/actual resource usage is like. Logs from each of your Spark applications would also be useful. Basically the more info the better. On Wed, Feb 24, 2016 at 2:52 PM Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Hi Hamel, > > Thanks for looking into the issue. What i am not understanding is, > after preemption what is the share that the second queue gets in case if > the first queue holds the entire cluster resource without releasing, is it > instantaneous fair share or fair share. > > Queue A and B are there (total 230 queues), total cluster resource is > 10TB, 3000 cores. If a job submitted into queue A, it will get 10TB, 3000 > cores and it is not releasing any resource. Now if a second job submitted > into queue B, so preemption definitely will happen, but what is the share > queue B will get after preemption. *Is it <10 TB , 3000> / 2 or > <10TB,3000> / 230* > > We find, after preemption queue B gets only <10TB,3000> / 230, because the > first job is holding the resource. In case if first job releases the > resource, the second queue will get <10TB,3000> /2 based on higher priority > and reservation. > > The question is how much preemption tries to preempt the queue A if it > holds the entire resource without releasing? Could not able to share the > actual configuration, but the answer to the question here will help us. > > > Thanks, > Prabhu Joseph > > > > > On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkoth...@gmail.com> > wrote: > >> If all queues are identical, this behavior should not be happening. >> Preemption as designed in fair scheduler (IIRC) takes place based on the >> instantaneous fair share, not the steady state fair share. The fair >> scheduler docs >> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html> >> aren't super helpful on this but it does say in the Monitoring section that >> preemption won't take place if you're less than your instantaneous fair >> share (which might imply that it would occur if you were over your inst. >> fair share and someone had requested resources). The code for >> FairScheduler.resToPreempt >> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29> >> also seems to use getFairShare rather than getSteadyFairShare() for >> preemption so that would imply that it is using instantaneous fair share >> rather than steady state. >> >> Could you share your YARN site/fair-scheduler and Spark configurations? >> Could you also share the YARN Scheduler UI (specifically the top of of the >> RM which shows how many resources are in use)? >> >> Since it's not likely due to steady state fair share, some other possible >> reasons why this might be taking place (this is not remotely conclusive but >> with no information this is what comes to mind): >> - You're not reaching >> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps >> due to core/memory ratio inconsistency with the cluster. >> - Your second job doesn't have a sufficient level of parallelism to >> request more executors than what it is recieving (perhaps there are fewer >> than 13 tasks at any point in time) and you don't have >> spark.dynamicAllocation.minExecutors set? >> >> -Hamel >> >> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <prabhujose.ga...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler >>> with root queue having 230 queues. >>> >>> Each Queue is configured with maxResources equal to Total Cluster >>> Resource. When a Spark job is submitted into a queue A, it is given with >>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding >>> the entire resource without releasing. After some time, when another job is >>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores >>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are >>> submitted into queue B, all the jobs in B has to share the 45GB and 13 >>> cores. Whereas the job which is in queue A holds the entire cluster >>> resource affecting the other jobs. >>> This kind of issue often happens when a Spark job submitted first >>> which holds the entire cluster resource. What is the best way to fix this >>> issue. Can we make preemption to happen for instantaneous fair share >>> instead of fair share, will it help. >>> >>> Note: >>> >>> 1. We do not want to give weight for particular queue. Because all the >>> 240 queues are critical. >>> 2. Changing the queues into nested does not solve the issue. >>> 3. Adding maxResource to queue won't allow the first job to pick entire >>> cluster resource, but still configuring the optimal maxResource for 230 >>> queue is difficult and also the first job can't use the entire cluster >>> resource when the cluster is idle. >>> 4. We do not want to handle it in Spark ApplicationMaster, then we need >>> to check for other new YARN application type with similar behavior. We want >>> YARN to control this behavior by killing the resources which is hold by >>> first job for longer period. >>> >>> >>> Thanks, >>> Prabhu Joseph >>> >>> >