[jira] [Created] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

Arjun Mohnot (Jira) Mon, 16 Sep 2024 10:33:04 -0700

Arjun Mohnot created YARN-11730:
-----------------------------------

             Summary: Resourcemanager node reporting enhancement for 
unregistered hosts
                 Key: YARN-11730
                 URL: https://issues.apache.org/jira/browse/YARN-11730
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager, yarn
    Affects Versions: 3.4.0
         Environment: Tested on multiple environments:


A. Docker Environment{*}:{*}
 * Base OS: *Ubuntu 20.04*
 * *Java 8* installed from OpenJDK.
 * Docker image includes Hadoop binaries, user configurations, and ports for 
YARN services.
 * Verified behavior using a Hadoop snapshot in a containerized environment.
 * Performed Namenode formatting and validated service interactions through 
exposed ports.
 * Repo reference: 
[arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]

B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
 * Running *Java 8* in a High-Availability (HA) configuration with *Zookeeper* 
for locking mechanism.
 * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
node, including state retention and proper node state transitions.
 * Verified node state transitions during RM failover, ensuring nodes moved 
between LOST, ACTIVE, and other states as expected.
            Reporter: Arjun Mohnot
             Fix For: 3.5.0


h3. Issue Overview

When the ResourceManager (RM) starts, nodes listed in the _"include"_ file are 
not immediately reported until their corresponding NodeManagers (NMs) send 
their first heartbeat. However, nodes in the _"exclude"_ file are instantly 
reflected in the _"Decommissioned Hosts"_ section with a port value -1.

This design creates several challenges:
 * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
standalone restart, some nodes may not report back, even though they are listed 
in the _"include"_ file. These nodes neither appear in the _LOST_ state nor are 
they represented in the RM's JMX metrics. This results in an untracked state, 
making it difficult to monitor their status. While in HDFS similar behaviour 
exists and is marked as {_}"DEAD"{_}.

 * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
they send their first heartbeat. This delay impacts real-time cluster 
monitoring, leading to a lack of immediate visibility for these nodes in 
Resourcemanager's state on the total no. of nodes.
 * {*}Operational Impact{*}: These unreported nodes cause operational 
difficulties, particularly in automated workflows such as OS Upgrade Automation 
(OSUA), node recovery automation, and others where validation depends on nodes 
being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or {_}DECOMMISSIONED, 
etc{_}. Nodes that don't report, however, require hacky workarounds to 
determine their accurate status.

h3. Proposed Solution

To address these issues, we propose automatically assigning the _LOST_ state to 
any node listed in the _"include"_ file by default at the RM startup or HA 
failover. This can be done by marking the node with a special port value 
{_}-2{_}, signaling that the node is considered LOST but has not yet been 
reported. Whenever a heartbeat is received for that 
{color:#de350b}nodeID{color}, it will be transitioned from _LOST_ to 
{_}RUNNING{_}, {_}UNHEALTHY{_}, or any other required desired state.
h3. Key implementation points
 * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of the 
RM active node context should be automatically marked as {_}LOST{_}. This can 
be achieved by modifying the _NodesListManager_ under the 
{color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
manual node refresh operations. This logic should ensure that all unregistered 
nodes are moved to the _LOST_ state, with port _-2_ indicating the node is 
untracked.

 * For non-HA setups, this process can be triggered during RM service startup 
to mark nodes as _LOST_ initially, and they will gradually transition to their 
desired state when the heartbeat is received.

 * Handle Node Heartbeat and Transition: When a node sends its first heartbeat, 
the system should verify if the node is listed in 
{color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ 
state, the RM should remove it from the inactive list, decrement the _LOST_ 
node count, and handle the transition back to the active node set.

 * This logic can be placed in the state transition method within 
{color:#de350b}RMNodeImpl.java{color}, ensuring that nodes transitioned from 
_NEW_ to _LOST_ state, and recover gracefully from the _LOST_ state upon 
receiving their heartbeat.

h3. Benefits
 * {*}Improved Cluster Monitoring{*}: Automatically assigning a _LOST_ state to 
nodes listed in the _"include"_ file but not reporting ensures that every node 
in the cluster has a well-defined state ({_}ACTIVE{_}, {_}LOST{_}, 
{_}DECOMMISSIONED{_}, {_}UNHEALTHY, etc{_}). This eliminates any potential gaps 
in cluster node visibility and simplifies operational monitoring.

 * {*}Better Recovery Management{*}: By marking unreported nodes as {_}LOST{_}, 
automation can quickly identify which nodes require attention during recovery 
efforts to restore cluster health. This prevents confusion between unreachable 
nodes and untracked nodes, improving recovery accuracy.

 * {*}Enhanced Cluster Stability{*}: This approach improves overall stability 
by preventing nodes from slipping into an untracked or unknown state. It 
guarantees that the system remains aware of all nodes, reducing issues during 
RM failover or restart scenarios.

h3. Additional Considerations
 * Feature Flag Control: This feature will be enabled/disabled via a 
configuration flag, allowing users to adjust behavior based on their 
requirements. By default, it is marked as {_}False{_}.
 * Enough Validations: The approach has been well-tested on non-HA and HA 
setups, and a dummy docker-based 
[setup|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] has been 
created to replicate the behavior. Added the required unit test cases to 
validate the code behavior. Demo 
[video|https://drive.google.com/file/d/1okiPe7uMNVMRUnNYtz-B8Igf8FMGr-SJ/view?usp=sharing]
 for this change.

 

Any thoughts/suggestions/feedback are welcome!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

Reply via email to