Syed Shameerur Rahman created YARN-11702:
--------------------------------------------
Summary: Fix Yarn over allocating containers
Key: YARN-11702
URL: https://issues.apache.org/jira/browse/YARN-11702
Project: Hadoop YARN
Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
*Replication Steps:*
Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
{code:java}
spark.executor.memory 1024M
spark.driver.memory 2048M
spark.executor.cores 1
spark.executor.instances 20
spark.dynamicAllocation.enabled false{code}
Based on the setup, there should be 20 spark executors, but from the
ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of
them were released in seconds. On analyzing the Spark ApplicationMaster (AM)
logs, The following logs were observed.
{code:java}
4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for
ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom
resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647>
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN,
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN,
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN,
launching executors on 4 of them.
24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN,
launching executors on 0 of them.
{code}
It was clear for the logs that extra allocated 12 containers are being ignored
from Spark side. Inorder to debug this further, additional log lines were added
to
[AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
class in increment and decrement of container request to expose additional
information about the request.
{code:java}
2024-06-24 14:10:14,075 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC
Server handler 42 on default port 8030): Updates PendingContainers: 0
Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,077 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,111 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,112 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,113 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,327 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,328 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,362 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,363 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,364 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24
14:10:14,454 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC
Server handler 35 on default port 8030): Updates PendingContainers: 4
Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,454 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC
Server handler 35 on default port 8030): Updates PendingContainers: 0
Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,578 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,579 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,614 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,615 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,829 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:14,864 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24
14:10:14,874 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC
Server handler 42 on default port 8030): Updates PendingContainers: 0
Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,080 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,081 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:15,115 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers:
1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0,
containerToUpdate=null} for: appattempt_1719234929152_0004_000001
2024-06-24 14:10:17,931 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC
Server handler 3 on default port 8030): checking for deactivate of application
:application_1719234929152_0004
2024-06-24 14:10:20,743 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo
(SchedulerEventDispatcher:Event Processor): Application
application_1719234929152_0004 requests cleared {code}
*RCA*
It looks like the issue is happening due to asynchronous nature of AMRMClient
response communication. The newly allocated containers from Yarn is send as
part of the next request from the client causing additional additional. Please
check the following example. As per the above mentioned logs
# At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn,
since this is the first request, yarn had not allocated any containers and send
0 containers as response.
# At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which
suggests that before this time there was an empty request ask to Yarn which had
returned 8 containers as allocated in reponse. For this request was well the
Yarn returned 8 container as allocated in response.
# At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn
for which it returns 12 containers allocated (based on previous request) as
response.
# Since all the container request is exhausted, AM sends empty request to yarn
which returns 4 containers allocated (based on previous request)
# So in total 32 containers were allocated for 20 container request. The
problem here is AM (client) is being notified about the allocated container
request in the next heartbeat/request causing inconsistency.
*Solution*
Normalize/update the AM container request to consdier the allocated containers.
This can be done in the respective scheduler side i.e newContainerAsk =
AMContainerAsk - AllocatedContainers.
*PS: I could see similar behavior with Apache Tez.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]