[
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan resolved YARN-11730.
-------------------------------
Hadoop Flags: Reviewed
Resolution: Fixed
> Resourcemanager node reporting enhancement for unregistered hosts
> -----------------------------------------------------------------
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager, yarn
> Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
> * Base OS: *Ubuntu 20.04*
> * *Java 8* installed from OpenJDK.
> * Docker image includes Hadoop binaries, user configurations, and ports for
> YARN services.
> * Verified behavior using a Hadoop snapshot in a containerized environment.
> * Performed Namenode formatting and validated service interactions through
> exposed ports.
> * Repo reference:
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
> * Running *Java 8* in a High-Availability (HA) configuration with
> *Zookeeper* for locking mechanism.
> * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM
> node, including state retention and proper node state transitions.
> * Verified node state transitions during RM failover, ensuring nodes moved
> between LOST, ACTIVE, and other states as expected.
> Reporter: Arjun Mohnot
> Assignee: Arjun Mohnot
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file
> are not immediately reported until their corresponding NodeManagers (NMs)
> send their first heartbeat. However, nodes in the _"exclude"_ file are
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value
> -1.
> This design creates several challenges:
> * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM
> standalone restart, some nodes may not report back, even though they are
> listed in the _"include"_ file. These nodes neither appear in the _LOST_
> state nor are they represented in the RM's JMX metrics. This results in an
> untracked state, making it difficult to monitor their status. While in HDFS
> similar behaviour exists and is marked as {_}"DEAD"{_}.
> * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until
> they send their first heartbeat. This delay impacts real-time cluster
> monitoring, leading to a lack of immediate visibility for these nodes in
> Resourcemanager's state on the total no. of nodes.
> * {*}Operational Impact{*}: These unreported nodes cause operational
> difficulties, particularly in automated workflows such as OS Upgrade
> Automation (OSUA), node recovery automation, and others where validation
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state
> to any node listed in the _"include"_ file that are not registered and not
> part of the exclude file by default at the RM startup or HA failover. This
> can be done by marking the node with a special port value {_}-2{_}, signaling
> that the node is considered LOST but has not yet been reported. Whenever a
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other
> required desired state.
> h3. Key implementation points
> * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of
> the RM active node context should be automatically marked as {_}LOST{_}. This
> can be achieved by modifying the _NodesListManager_ under the
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or
> manual node refresh operations. This logic should ensure that all
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating
> the node is untracked.
> * For non-HA setups, this process can be triggered during RM service startup
> to mark nodes as _LOST_ initially, and they will gradually transition to
> their desired state when the heartbeat is received.
> * Handle Node Heartbeat and Transition: When a node sends its first
> heartbeat, the system should verify if the node is listed in
> {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_
> state, the RM should remove it from the inactive list, decrement the _LOST_
> node count, and handle the transition back to the active node set.
> * This logic can be placed in the state transition method within
> {color:#de350b}RMNodeImpl.java{color}, ensuring that nodes transitioned from
> _NEW_ to _LOST_ state, and recover gracefully from the _LOST_ state upon
> receiving their heartbeat.
> h3. Benefits
> * {*}Improved Cluster Monitoring{*}: Automatically assigning a _LOST_ state
> to nodes listed in the _"include"_ file but not reporting ensures that every
> node in the cluster has a well-defined state ({_}ACTIVE{_}, {_}LOST{_},
> {_}DECOMMISSIONED{_}, {_}UNHEALTHY, etc{_}). This eliminates any potential
> gaps in cluster node visibility and simplifies operational monitoring.
> * {*}Better Recovery Management{*}: By marking unreported nodes as
> {_}LOST{_}, automation can quickly identify which nodes require attention
> during recovery efforts to restore cluster health. This prevents confusion
> between unreachable nodes and untracked nodes, improving recovery accuracy.
> * {*}Enhanced Cluster Stability{*}: This approach improves overall stability
> by preventing nodes from slipping into an untracked or unknown state. It
> guarantees that the system remains aware of all nodes, reducing issues during
> RM failover or restart scenarios.
> h3. Additional Considerations
> * Feature Flag Control: This feature will be enabled/disabled via a
> configuration flag, allowing users to adjust behavior based on their
> requirements. By default, it is marked as {_}False{_}.
> * Enough Validations: The approach has been well-tested on non-HA and HA
> setups, and a dummy docker-based
> [setup|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] has been
> created to replicate the behavior. Added the required unit test cases to
> validate the code behavior. Demo
> [video|https://drive.google.com/file/d/1okiPe7uMNVMRUnNYtz-B8Igf8FMGr-SJ/view?usp=sharing]
> for this change.
>
> Any thoughts/suggestions/feedback are welcome!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]