Re: Spark Job on YARN Hogging the entire Cluster resource

Prabhu Joseph Wed, 24 Feb 2016 22:37:39 -0800

YARN-2026 has fixed the issue.

On Thu, Feb 25, 2016 at 4:17 AM, Prabhu Joseph <prabhujose.ga...@gmail.com>
wrote:


> You are right, Hamel. It should get 10 TB /2. And In hadoop-2.7.0, it is
> working fine. But in hadoop-2.5.1, it gets only 10TB/230. The same
> configuration used in both versions.
> So i think a JIRA could have fixed the issue after hadoop-2.5.1.
>
> On Thu, Feb 25, 2016 at 1:28 AM, Hamel Kothari <hamelkoth...@gmail.com>
> wrote:
>
>> The instantaneous fair share is what Queue B should get according to the
>> code (and my experience). Assuming your queues are all equal it would be
>> 10TB/2.
>>
>> I can't help much more unless I can see your config files and ideally
>> also the YARN Scheduler UI to get an idea of what your queues/actual
>> resource usage is like. Logs from each of your Spark applications would
>> also be useful. Basically the more info the better.
>>
>> On Wed, Feb 24, 2016 at 2:52 PM Prabhu Joseph <prabhujose.ga...@gmail.com>
>> wrote:
>>
>>> Hi Hamel,
>>>
>>>     Thanks for looking into the issue. What i am not understanding is,
>>> after preemption what is the share that the second queue gets in case if
>>> the first queue holds the entire cluster resource without releasing, is it
>>> instantaneous fair share or fair share.
>>>
>>>      Queue A and B are there (total 230 queues), total cluster resource
>>> is 10TB, 3000 cores. If a job submitted into queue A, it will get 10TB,
>>> 3000 cores and it is not releasing any resource. Now if a second job
>>> submitted into queue B, so preemption definitely will happen, but what is
>>> the share queue B will get after preemption. *Is it  <10 TB , 3000> / 2
>>> or <10TB,3000> / 230*
>>>
>>> We find, after preemption queue B gets only <10TB,3000> / 230, because
>>> the first job is holding the resource. In case if first job releases the
>>> resource, the second queue will get <10TB,3000> /2 based on higher priority
>>> and reservation.
>>>
>>> The question is how much preemption tries to preempt the queue A if it
>>> holds the entire resource without releasing? Could not able to share the
>>> actual configuration, but the answer to the question here will help us.
>>>
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>>
>>>
>>>
>>> On Wed, Feb 24, 2016 at 10:03 PM, Hamel Kothari <hamelkoth...@gmail.com>
>>> wrote:
>>>
>>>> If all queues are identical, this behavior should not be happening.
>>>> Preemption as designed in fair scheduler (IIRC) takes place based on the
>>>> instantaneous fair share, not the steady state fair share. The fair
>>>> scheduler docs
>>>> <https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
>>>> aren't super helpful on this but it does say in the Monitoring section that
>>>> preemption won't take place if you're less than your instantaneous fair
>>>> share (which might imply that it would occur if you were over your inst.
>>>> fair share and someone had requested resources). The code for
>>>> FairScheduler.resToPreempt
>>>> <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.7.1/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#FairScheduler.resToPreempt%28org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue%2Clong%29>
>>>> also seems to use getFairShare rather than getSteadyFairShare() for
>>>> preemption so that would imply that it is using instantaneous fair share
>>>> rather than steady state.
>>>>
>>>> Could you share your YARN site/fair-scheduler and Spark configurations?
>>>> Could you also share the YARN Scheduler UI (specifically the top of of the
>>>> RM which shows how many resources are in use)?
>>>>
>>>> Since it's not likely due to steady state fair share, some other
>>>> possible reasons why this might be taking place (this is not remotely
>>>> conclusive but with no information this is what comes to mind):
>>>> - You're not reaching
>>>> yarn.scheduler.fair.preemption.cluster-utilization-threshold. Perhaps
>>>> due to core/memory ratio inconsistency with the cluster.
>>>> - Your second job doesn't have a sufficient level of parallelism to
>>>> request more executors than what it is recieving (perhaps there are fewer
>>>> than 13 tasks at any point in time) and you don't have
>>>> spark.dynamicAllocation.minExecutors set?
>>>>
>>>> -Hamel
>>>>
>>>> On Tue, Feb 23, 2016 at 8:20 PM Prabhu Joseph <
>>>> prabhujose.ga...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>>  A YARN cluster with 352 Nodes (10TB, 3000cores) and has Fair
>>>>> Scheduler with root queue having 230 queues.
>>>>>
>>>>>     Each Queue is configured with maxResources equal to Total Cluster
>>>>> Resource. When a Spark job is submitted into a queue A, it is given with
>>>>> 10TB, 3000 cores according to instantaneous Fair Share and it is holding
>>>>> the entire resource without releasing. After some time, when another job 
>>>>> is
>>>>> submitted into other queue B, it will get the Fair Share 45GB and 13 cores
>>>>> i.e (10TB,3000 cores)/230 using Preemption. Now if some more jobs are
>>>>> submitted into queue B, all the jobs in B has to share the 45GB and 13
>>>>> cores. Whereas the job which is in queue A holds the entire cluster
>>>>> resource affecting the other jobs.
>>>>>      This kind of issue often happens when a Spark job submitted first
>>>>> which holds the entire cluster resource. What is the best way to fix this
>>>>> issue. Can we make preemption to happen for instantaneous fair share
>>>>> instead of fair share, will it help.
>>>>>
>>>>> Note:
>>>>>
>>>>> 1. We do not want to give weight for particular queue. Because all the
>>>>> 240 queues are critical.
>>>>> 2. Changing the queues into nested does not solve the issue.
>>>>> 3. Adding maxResource to queue  won't allow the first job to pick
>>>>> entire cluster resource, but still configuring the optimal maxResource for
>>>>> 230 queue is difficult and also the first job can't use the entire cluster
>>>>> resource when the cluster is idle.
>>>>> 4. We do not want to handle it in Spark ApplicationMaster, then we
>>>>> need to check for other new YARN application type with similar behavior. 
>>>>> We
>>>>> want YARN to control this behavior by killing the resources which is hold
>>>>> by first job for longer period.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Prabhu Joseph
>>>>>
>>>>>
>>>
>

Re: Spark Job on YARN Hogging the entire Cluster resource

Reply via email to