YARN-2026 has fixed the issue. On Thu, Feb 25, 2016 at 4:17 AM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote:
> You are right, Hamel. It should get 10 TB /2. And In hadoop-2.7.0, it is > working fine. But in hadoop-2.5.1, it gets only 10TB/230. The same > configuration used in both versions. > So i think a JIRA could have fixed the issue after hadoop-2.5.1. > > On Thu, Feb 25, 2016 at 1:28 AM, Hamel Kothari <hamelkoth...@gmail.com> > wrote: > >> The instantaneous fair share is what Queue B should get according to the >> code (and my experience). Assuming your queues are all equal it would be >> 10TB/2. >> >> I can't help much more unless I can see your config files and ideally >> also the YARN Scheduler UI to get an idea of what your queues/actual >> resource usage is like. Logs from each of your Spark applications would >> also be useful. Basically the more info the better. >> >> On Wed, Feb 24, 2016 at 2:52 PM Prabhu Joseph <prabhujose.ga...@gmail.com> >> wrote: >> >>> Hi Hamel, >>> >>> Thanks for looking into the issue. What i am not understanding is, >>> after preemption what is the share that the second queue gets in case if >>> the first queue holds the entire cluster resource without releasing, is it >>> instantaneous fair share or fair share. >>> >>> Queue A and B are there (total 230 queues), total cluster resource >>> is 10TB, 3000 cores. If a job submitted into queue A, it will get 10TB, >>> 3000 cores and it is not releasing any resource. Now if a second job >>> submitted into queue B, so preemption definitely will happen, but what is >>> the share queue B will get after preemption. *Is it <10 TB , 3000> / 2 >>> or <10TB,3000> / 230* >>> >>> We find, after preemption queue B gets only <10TB,3000> / 230, because >>> the first job is holding the resource. In case if first job releases the >>> resource, the second queue will get <10TB,3000> /2 based on higher priority >>> and reservation. >>> >>> The question is how much preemption tries to preempt the queue A if it >>> holds the entire resource without releasing? Could not able to share the >>> actual configuration, but the answer to the question here will help us. >>> >>> >>> Thanks, >>> Prabhu Joseph >>> >>> >>> >>> >>> On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkoth...@gmail.com> >>> wrote: >>> >>>> If all queues are identical, this behavior should not be happening. >>>> Preemption as designed in fair scheduler (IIRC) takes place based on the >>>> instantaneous fair share, not the steady state fair share. The fair >>>> scheduler docs >>>> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html> >>>> aren't super helpful on this but it does say in the Monitoring section that >>>> preemption won't take place if you're less than your instantaneous fair >>>> share (which might imply that it would occur if you were over your inst. >>>> fair share and someone had requested resources). The code for >>>> FairScheduler.resToPreempt >>>> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29> >>>> also seems to use getFairShare rather than getSteadyFairShare() for >>>> preemption so that would imply that it is using instantaneous fair share >>>> rather than steady state. >>>> >>>> Could you share your YARN site/fair-scheduler and Spark configurations? >>>> Could you also share the YARN Scheduler UI (specifically the top of of the >>>> RM which shows how many resources are in use)? >>>> >>>> Since it's not likely due to steady state fair share, some other >>>> possible reasons why this might be taking place (this is not remotely >>>> conclusive but with no information this is what comes to mind): >>>> - You're not reaching >>>> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps >>>> due to core/memory ratio inconsistency with the cluster. >>>> - Your second job doesn't have a sufficient level of parallelism to >>>> request more executors than what it is recieving (perhaps there are fewer >>>> than 13 tasks at any point in time) and you don't have >>>> spark.dynamicAllocation.minExecutors set? >>>> >>>> -Hamel >>>> >>>> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph < >>>> prabhujose.ga...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair >>>>> Scheduler with root queue having 230 queues. >>>>> >>>>> Each Queue is configured with maxResources equal to Total Cluster >>>>> Resource. When a Spark job is submitted into a queue A, it is given with >>>>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding >>>>> the entire resource without releasing. After some time, when another job >>>>> is >>>>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores >>>>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are >>>>> submitted into queue B, all the jobs in B has to share the 45GB and 13 >>>>> cores. Whereas the job which is in queue A holds the entire cluster >>>>> resource affecting the other jobs. >>>>> This kind of issue often happens when a Spark job submitted first >>>>> which holds the entire cluster resource. What is the best way to fix this >>>>> issue. Can we make preemption to happen for instantaneous fair share >>>>> instead of fair share, will it help. >>>>> >>>>> Note: >>>>> >>>>> 1. We do not want to give weight for particular queue. Because all the >>>>> 240 queues are critical. >>>>> 2. Changing the queues into nested does not solve the issue. >>>>> 3. Adding maxResource to queue won't allow the first job to pick >>>>> entire cluster resource, but still configuring the optimal maxResource for >>>>> 230 queue is difficult and also the first job can't use the entire cluster >>>>> resource when the cluster is idle. >>>>> 4. We do not want to handle it in Spark ApplicationMaster, then we >>>>> need to check for other new YARN application type with similar behavior. >>>>> We >>>>> want YARN to control this behavior by killing the resources which is hold >>>>> by first job for longer period. >>>>> >>>>> >>>>> Thanks, >>>>> Prabhu Joseph >>>>> >>>>> >>> >