arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##########
@@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception {
// Non HA case, start after RM services are started.
if (!this.rmContext.isHAEnabled()) {
transitionToActive();
+
+ // Refresh node state at the service startup to reflect the unregistered
+ // nodemanagers as LOST if the tracking for unregistered nodes flag is
enabled.
+ // For HA setup, refreshNodes is already being called during the
transition.
+ Configuration yarnConf = getConfig();
+ if (yarnConf.getBoolean(
+ YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+ YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+ this.rmContext.getNodesListManager().refreshNodes(yarnConf);
Review Comment:
Hey @zeekling, thanks for your question! After reviewing potential edge
cases and comparing the existing implementation, here’s a summary of the
different scenarios:
### Unregistered Lost Node Definition
- A node is marked as _LOST_ when listed in the _"include"_ file but not
registered in the ResourceManager's active node context, and is also not part
of the _"exclude"_ file during startup or HA failover. The unregistered lost
node is indicated by a port value of -2.
### **Case 1**: Node Marked as LOST, Heartbeat Received
- When a node is marked as **LOST**, a lost event is dispatched, adding the
node to the **active** and **inactive** node maps of the RMContext.
- If the node sends a heartbeat afterward, the transition method in
`RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST
state).
- If found, the nodeID is removed from the `rmContext` and re-registered
with the desired port of the NM.
- It also decrements the LOST node counter and increments the ACTIVE node
counter, ensuring a clean state of transitions.
### **Case 2**: **Rare Scenario** - Race Condition
- A race condition may occur if the **ResourceTrackerService** starts before
the RM starts processing the unregistered lost nodes, and the NodeManager (NM)
sends its heartbeat **quickly** in parallel.
- **Example**:
- When fetching nodes from `rmContext`, an NM (say **host1**) may not
initially be present in the context.
- Before this operation completes, **host1** may send a heartbeat and
get registered with a valid port.
- Meanwhile, the RM could still attempt to mark the same NM (host1) as
LOST with port -2 as it was not registered while querying the context,
resulting in two entries for the same host: one as ACTIVE and another as LOST.
### Details on Case-2
- I reviewed the code and found that for a node to register, the
`ResourceTracker` service must start during service startup. In HA mode, nodes
only register once the RM becomes active.
- The current implementation for HA calls the `refreshNodes` function before
the `transitionToActive` method, which rules out the race condition for HA
setup since all unregistered nodes are dispatched first.
For standalone setups, there was a slight oversight. However, during
testing, I did not encounter or replicate this issue, as the NM heartbeat can
take time to register while the nodes were marked as LOST beforehand. With the
recent changes, I am making the `refreshNodes` operation before the service
starts. This change ensures that unregistered nodes are consistently marked as
LOST with **port -2** first. Only afterwards NMs can register themselves with a
proper port during heartbeat reception when the `ResourceTrackerService` starts
the server (Triggered during RM service start).
### Conclusion
The updated change should guarantee that the nodes are properly marked as
LOST **before** any heartbeats are processed. This eliminates the chance of
falsely reporting nodes as LOST. I’ve validated this behavior through logs,
which show that the lost event is dispatched first, and then the heartbeats
from NMs received after service start.
Let me know if this clears things up! 😊
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]