Syed Shameerur Rahman created YARN-11697:
--------------------------------------------
Summary: Fix fair scheduler race condition in
removeApplicationAttempt and moveApplication
Key: YARN-11697
URL: https://issues.apache.org/jira/browse/YARN-11697
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 3.2.1
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the
following exception
{code:java}
2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher
(SchedulerEventDispatcher:Event Processor): Error in handling event type
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.IllegalStateException: Given app to remove
appattempt_1706879498319_86660_000001 Alloc: <memory:0, vCores:0> does not
exist in queue [root.tier2.livy, demand=<memory:10826752, vCores:2101>,
running=<memory:99328, vCores:17>, share=<memory:6201984, vCores:0>, w=1.0]
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
at
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:750)
{code}
The exception seems similar to the one mentioned in YARN-5136, but it looks
like there is still some edge cases not covered by YARN-5136.
1. On deeper look, i could see that as mentioned in the comment here. if a call
for a moveApplication and removeApplicationAttempt for the same attempt are
processed in short succession the application attempt will still contain a
queue reference but is already removed from the list of applications for the
queue.
2. This can happen when
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
removes the appAttempt from the queue and
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
also tries to remove the same appAttempt from the queue.
3. On further checking, i could see that before doing
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
writeLock on appAttempt is taken where as for
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
, i don't see any writelock being taken which can result in race condition if
same appAttempt is being processed.
4. Additionally as mentioned in the comment here when such scenario occurs
ideally we should not take down RM.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]