You are right, Hamel. It should get 10 TB /2. And In hadoop-2.7.0, it is working fine. But in hadoop-2.5.1, it gets only 10TB/230. The same configuration used in both versions. So i think a JIRA could have fixed the issue after hadoop-2.5.1.
On Thu, Feb 25, 2016 at 1:28 AM, Hamel Kothari <hamelkoth...@gmail.com> wrote: > The instantaneous fair share is what Queue B should get according to the > code (and my experience). Assuming your queues are all equal it would be > 10TB/2. > > I can't help much more unless I can see your config files and ideally also > the YARN Scheduler UI to get an idea of what your queues/actual resource > usage is like. Logs from each of your Spark applications would also be > useful. Basically the more info the better. > > On Wed, Feb 24, 2016 at 2:52 PM Prabhu Joseph <prabhujose.ga...@gmail.com> > wrote: > >> Hi Hamel, >> >> Thanks for looking into the issue. What i am not understanding is, >> after preemption what is the share that the second queue gets in case if >> the first queue holds the entire cluster resource without releasing, is it >> instantaneous fair share or fair share. >> >> Queue A and B are there (total 230 queues), total cluster resource >> is 10TB, 3000 cores. If a job submitted into queue A, it will get 10TB, >> 3000 cores and it is not releasing any resource. Now if a second job >> submitted into queue B, so preemption definitely will happen, but what is >> the share queue B will get after preemption. *Is it <10 TB , 3000> / 2 >> or <10TB,3000> / 230* >> >> We find, after preemption queue B gets only <10TB,3000> / 230, because >> the first job is holding the resource. In case if first job releases the >> resource, the second queue will get <10TB,3000> /2 based on higher priority >> and reservation. >> >> The question is how much preemption tries to preempt the queue A if it >> holds the entire resource without releasing? Could not able to share the >> actual configuration, but the answer to the question here will help us. >> >> >> Thanks, >> Prabhu Joseph >> >> >> >> >> On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkoth...@gmail.com> >> wrote: >> >>> If all queues are identical, this behavior should not be happening. >>> Preemption as designed in fair scheduler (IIRC) takes place based on the >>> instantaneous fair share, not the steady state fair share. The fair >>> scheduler docs >>> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html> >>> aren't super helpful on this but it does say in the Monitoring section that >>> preemption won't take place if you're less than your instantaneous fair >>> share (which might imply that it would occur if you were over your inst. >>> fair share and someone had requested resources). The code for >>> FairScheduler.resToPreempt >>> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29> >>> also seems to use getFairShare rather than getSteadyFairShare() for >>> preemption so that would imply that it is using instantaneous fair share >>> rather than steady state. >>> >>> Could you share your YARN site/fair-scheduler and Spark configurations? >>> Could you also share the YARN Scheduler UI (specifically the top of of the >>> RM which shows how many resources are in use)? >>> >>> Since it's not likely due to steady state fair share, some other >>> possible reasons why this might be taking place (this is not remotely >>> conclusive but with no information this is what comes to mind): >>> - You're not reaching >>> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps >>> due to core/memory ratio inconsistency with the cluster. >>> - Your second job doesn't have a sufficient level of parallelism to >>> request more executors than what it is recieving (perhaps there are fewer >>> than 13 tasks at any point in time) and you don't have >>> spark.dynamicAllocation.minExecutors set? >>> >>> -Hamel >>> >>> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph < >>> prabhujose.ga...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair Scheduler >>>> with root queue having 230 queues. >>>> >>>> Each Queue is configured with maxResources equal to Total Cluster >>>> Resource. When a Spark job is submitted into a queue A, it is given with >>>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding >>>> the entire resource without releasing. After some time, when another job is >>>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores >>>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are >>>> submitted into queue B, all the jobs in B has to share the 45GB and 13 >>>> cores. Whereas the job which is in queue A holds the entire cluster >>>> resource affecting the other jobs. >>>> This kind of issue often happens when a Spark job submitted first >>>> which holds the entire cluster resource. What is the best way to fix this >>>> issue. Can we make preemption to happen for instantaneous fair share >>>> instead of fair share, will it help. >>>> >>>> Note: >>>> >>>> 1. We do not want to give weight for particular queue. Because all the >>>> 240 queues are critical. >>>> 2. Changing the queues into nested does not solve the issue. >>>> 3. Adding maxResource to queue won't allow the first job to pick >>>> entire cluster resource, but still configuring the optimal maxResource for >>>> 230 queue is difficult and also the first job can't use the entire cluster >>>> resource when the cluster is idle. >>>> 4. We do not want to handle it in Spark ApplicationMaster, then we need >>>> to check for other new YARN application type with similar behavior. We want >>>> YARN to control this behavior by killing the resources which is hold by >>>> first job for longer period. >>>> >>>> >>>> Thanks, >>>> Prabhu Joseph >>>> >>>> >>