[PR] YARN-11730. Mark unreported nodes as LOST on RM Startup/HA failover [hadoop]

via GitHub Mon, 16 Sep 2024 13:02:18 -0700


arjunmohnot opened a new pull request, #7049:
URL: https://github.com/apache/hadoop/pull/7049


   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   #### 1. Overview
   When the ResourceManager starts, nodes listed in the "include" file are not 
immediately reported until their corresponding NodeManagers send their first 
heartbeat. However, nodes in the "exclude" file are instantly reflected in the 
"Decommissioned Hosts" section with a port value of -1.
   
   #### 2. Challenges
   1. **Untracked NodeManagers**: During Resourcemanager HA failover or RM 
standalone restart, some nodes may not report back, even though they are listed 
in the _"include"_ file. These nodes neither appear in the _LOST_ state nor are 
they represented in the RM's JMX metrics. This results in an untracked state, 
making it difficult to monitor their status. While in HDFS similar behaviour 
exists and datanodes are marked as _"DEAD"_.
   2. **Monitoring Gaps**: Nodes in the "include" file are not visible until 
they send their first heartbeat, impacting real-time cluster monitoring when 
being dependent on cluster metrics sink.
   3. **Operational Impact**: Unreported nodes cause operational difficulties, 
particularly in automated workflows such as OS Upgrade Automation (OSUA), node 
recovery automation, etc. requiring workarounds to determine accurate status 
for nodes that don't report.
   
   #### 3. Proposed Solution
   To address these issues, the code automatically assigns the **_LOST_** state 
to nodes listed in the _"include"_ file that are not registered and not part of 
the exclude file at RM startup or during HA failover. This is indicated by a 
special port value of **-2**, marking the node as LOST but not yet reported. 
Once a heartbeat is received for that node, it will transition from LOST to 
RUNNING, UNHEALTHY, or any other desired state.
   
   #### 4. Key Implementation Points
   1. **Mark Unreported Nodes as LOST**:
      - **Class Modified**: `NodesListManager`
      - **Method**: `refreshHostsReader`
      - **Functionality**:
        - Automatically marks nodes listed in the **"include"** file as 
**LOST** if they are not part of the RM active node context.
        - For non-HA setups, this process is triggered during **RM service 
startup**, ensuring unregistered nodes are initially set to **LOST**.
        - Port value **-2** indicates that the node is untracked.
   
   2. **Handle Node Heartbeat and Transition**:
       - **Class Modified**: `RMNodeImpl`
       - **Method**: State transition method
       - **Functionality**:
         - Upon receiving the first heartbeat from a node, the system checks if 
the node exists in the **LOST** state (If nodeID has -2 port for that host) by 
verifying against `getInactiveRMNodes()`.
         - If the node is found in the **LOST** state:
           - Remove the node from the inactive node list.
           - Remove the node from the active node list to register it with a 
new nodeID having its required port.
           - Maintain the hostname in the RM context for proper host tracking.
           - Decrement the count of **LOST** nodes.
           - Re-register the node with the new nodeID and transition it back to 
the active node set, ensuring it recovers gracefully from the **LOST** state.
         - This logic ensures a smooth transition for nodes from **NEW** to 
**LOST** and back to active upon heartbeat reception.
   
   #### 5. Flow Diagram
     ```yaml
       +---------------------------+
       |  RM Startup / HA Failover |
       +---------------------------+
                |
                v
       Check Nodes in RM Context
                |
                +-----------------------------+
                |                             |
       Not Registered & Not in         Registered or in
       Exclude File                    Exclude File
                |                             |
                v                             v
        Mark Node as LOST (port -2)   Node processed normally
                |
                v
        Wait for Heartbeat
                |
                v
        Receive Heartbeat
                |
                v
       Node State Check
                |
                +------------------------------------v
                |                                    |
       Previous NodeID Removed                Same Hostname 
       With Port -2                           Still Remains in the RM Context
                |                                    |
                |                                    |
                v                                    |
          [No Further Transition]                    |
                                                     |
                                                     |
              Handle Node Re-Registration -----------+
                            |
                            v
              New NodeID (Same Hostname + New Registered Port)
                            |
                            v
              +-----------------------+
              |   Transition to       |
              |   ACTIVE/RUNNING      |
              +-----------------------+
   
     ```
   
   #### 6. Additional Considerations
   - **Feature Flag Control**: This feature can be enabled/disabled via a 
configuration flag (default: False).
     ```bash
     <property>
       <name>yarn.resourcemanager.enable-tracking-for-unregistered-nodes</name>
       <value>false</value>
     </property>
     ```
   
   ### How was this patch tested?
   - The implementation has been tested on multiple environments both in non-HA 
and HA setups. A dummy Docker-based 
[setup](https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main) has been 
created to replicate the behavior, and bare metal HA setup with Zookeeper was 
used to validate the failover scenario between HA1 and HA2 RM node.
   - Added the required unit test cases that passes with the modified code.
     ```bash
     [INFO] -------------------------------------------------------
     [INFO]  T E S T S
     [INFO] -------------------------------------------------------
     [INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
     [INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
80.975 s - in 
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
     [INFO]
     [INFO] Results:
     [INFO]
     [INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0
     [INFO]
     [INFO] 
------------------------------------------------------------------------
     [INFO] BUILD SUCCESS
     [INFO] 
------------------------------------------------------------------------
     [INFO] Total time:  02:10 min
     [INFO] Finished at: 2024-09-16T20:31:18+05:30
     [INFO] 
------------------------------------------------------------------------
     ```
   - Build and code compilation also succeeded.
     ```bash
     [INFO] 
------------------------------------------------------------------------
     [INFO] BUILD SUCCESS
     [INFO] 
------------------------------------------------------------------------
     [INFO] Total time:  29:27 min
     [INFO] Finished at: 2024-09-16T21:25:07+05:30
     [INFO] 
------------------------------------------------------------------------
     ```
   - Reference Logs:
     - Event dispatch to mark the qualified unregistered nodes as LOST:
       ```bash
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:427) - Lost node: nodemanager2
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:427) - Lost node: nodemanager1
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:427) - Lost node: nodemanager3
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:491) - Successfully dispatched LOST event and 
deactivated node: nodemanager2, Node ID: nodemanager2:-2
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:491) - Successfully dispatched LOST event and 
deactivated node: nodemanager1, Node ID: nodemanager1:-2
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:491) - Successfully dispatched LOST event and 
deactivated node: nodemanager3, Node ID: nodemanager3:-2
       2024-09-16 17:16:28,550 INFO  resourcemanager.NodesListManager 
(NodesListManager.java:438) - Successfully marked unregistered nodes as LOST
       2024-09-16 17:16:28,552 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:1232) - 
Deactivating Node nodemanager2:-2 as it is now LOST
       2024-09-16 17:16:29,278 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:785) - 
nodemanager2:-2 Node Transitioned from NEW to LOST
       2024-09-16 17:16:29,278 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:1232) - 
Deactivating Node nodemanager1:-2 as it is now LOST
       2024-09-16 17:16:30,068 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:785) - 
nodemanager1:-2 Node Transitioned from NEW to LOST
       2024-09-16 17:16:30,069 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:1232) - 
Deactivating Node nodemanager3:-2 as it is now LOST
       2024-09-16 17:16:30,778 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:785) - 
nodemanager3:-2 Node Transitioned from NEW to LOST
       ```
     - Once the heartbeat is received it re-registers the hostname and mark 
them to the running state:
       ```bash
       2024-09-16 17:17:12,672 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:785) - 
nodemanager1:38065 Node Transitioned from NEW to RUNNING
       2024-09-16 17:17:12,673 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:2249) - Added node nodemanager1:38065 clusterResource: 
<memory:16384, vCores:16>
       2024-09-16 17:17:12,732 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:785) - 
nodemanager2:40107 Node Transitioned from NEW to RUNNING
       2024-09-16 17:17:12,734 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:2249) - Added node nodemanager2:40107 clusterResource: 
<memory:24576, vCores:24>
       2024-09-16 17:17:12,262 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:785) - 
nodemanager3:39725 Node Transitioned from NEW to RUNNING
       2024-09-16 17:17:12,265 INFO  capacity.CapacityScheduler 
(CapacityScheduler.java:2249) - Added node nodemanager3:39725 clusterResource: 
<memory:8192, vCores:8>
       ```
   
   - Reference screenshots and video to show LOST nodes, and active nodes state 
transition after the NM re-registration (Note NMs were automatically marked as 
lost even though its initial heartbeat was not registered at the time of RM 
startup):
     <img width="1726" alt="Unregistered lost nodes" 
src="https://github.com/user-attachments/assets/796df3b5-78b1-4ce8-98d4-f65ef23d5e13";>
     <img width="1724" alt="Nodes transitioned from lost to active" 
src="https://github.com/user-attachments/assets/0b995e92-d47b-43f1-8398-69de781bf5b7";>
   
   
     
https://github.com/user-attachments/assets/c8bbe1c7-ddb6-4ba9-bdf3-1d92be908f51
   
     
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] YARN-11730. Mark unreported nodes as LOST on RM Startup/HA failover [hadoop]

Reply via email to