[
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan resolved YARN-11702.
-------------------------------
Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
Resolution: Fixed
> Fix Yarn over allocating containers
> -----------------------------------
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler, fairscheduler, scheduler, yarn
> Reporter: Syed Shameerur Rahman
> Assignee: Syed Shameerur Rahman
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>
> {code:java}
> spark.executor.memory 1024M
> spark.driver.memory 2048M
> spark.executor.cores 1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>
> Based on the setup, there should be 20 spark executors, but from the
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12
> of them were released in seconds. On analyzing the Spark ApplicationMaster
> (AM) logs, The following logs were observed.
>
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s)
> for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with
> custom resources: <memory:1408, max memory:2147483647, vCores:1, max
> vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN,
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN,
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN,
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN,
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being
> ignored from Spark side. Inorder to debug this further, additional log lines
> were added to
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
> class in increment and decrement of container request to expose additional
> information about the request.
>
> {code:java}
> 2024-06-24 14:10:14,075 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0,
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,077 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,077 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,111 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,112 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,112 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,113 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,113 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,113 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,327 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,328 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,362 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,363 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,364 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4
> Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0,
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,454 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0
> Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0,
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,578 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,579 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,614 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,614 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,614 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,615 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,615 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,615 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,829 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,829 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,864 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:14,864 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0
> Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0,
> containerToUpdate=null} for: appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,080 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,081 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,115 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:15,115 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Allocate Updates
> PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0,
> allocationRequestId=0, containerToUpdate=null} for:
> appattempt_1719234929152_0004_000001
> 2024-06-24 14:10:17,931 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (IPC Server handler 3 on default port 8030): checking for deactivate of
> application :application_1719234929152_0004
> 2024-06-24 14:10:20,743 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
> (SchedulerEventDispatcher:Event Processor): Application
> application_1719234929152_0004 requests cleared {code}
> *RCA*
> It looks like the issue is happening due to asynchronous nature of AMRMClient
> response communication. The newly allocated containers from Yarn is send as
> part of the next request from the client causing additional additional.
> Please check the following example. As per the above mentioned logs
> # At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn,
> since this is the first request, yarn had not allocated any containers and
> send 0 containers as response.
> # At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn
> which suggests that before this time there was an empty request ask to Yarn
> which had returned 8 containers as allocated in reponse. For this request was
> well the Yarn returned 8 container as allocated in response.
> # At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn
> for which it returns 12 containers allocated (based on previous request) as
> response.
> # Since all the container request is exhausted, AM sends empty request to
> yarn which returns 4 containers allocated (based on previous request)
> # So in total 32 containers were allocated for 20 container request. The
> problem here is AM (client) is being notified about the allocated container
> request in the next heartbeat/request causing inconsistency.
>
> *Solutions*
> *1. Modify the AM request in the Yarn/RM side*
> Normalize/update the AM container request to consider the allocated
> containers. This can be done in the respective scheduler side i.e
> newContainerAsk = AMContainerAsk - AllocatedContainers.
> *2. Make AM aware of allocated container*
> Before client making allocate request, Check AM should check with RM about
> container allocated. This approach requires additional AM-RM client
> communication which can be expense when there are large number of allocate
> requests.
>
> *PS: I could see similar behavior with Apache Tez and could happen with the
> latest hadoop-3.4.0 version as well.*
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]