Aswin M Prabhu created YARN-11718:
-------------------------------------

             Summary: Provide config option to not shutdown NM if it is 
decommissioned
                 Key: YARN-11718
                 URL: https://issues.apache.org/jira/browse/YARN-11718
             Project: Hadoop YARN
          Issue Type: New Feature
          Components: resourcemanager
            Reporter: Aswin M Prabhu


Currently, an NM cannot be started if it is marked as decommissioned on the RM 
(in the exclude list) because RM sends a SHUTDOWN signal when NM tries to send 
a heartbeat after starting up:

 
{code:java}
    // Check if this node is a 'valid' node
    if (!this.nodesListManager.isValidNode(host) &&
        !isNodeInDecommissioning(nodeId)) {
      String message =
          "Disallowed NodeManager from  " + host
              + ", Sending SHUTDOWN signal to the NodeManager.";
      LOG.info(message);
      response.setDiagnosticsMessage(message);
      response.setNodeAction(NodeAction.SHUTDOWN);
      return response;
    } {code}
This couples the start/stop operations of the NM service very tightly with its 
state in the RM making it difficult to manage large fleets of NMs independently 
from the RM.

For example, after an NM OS upgrade, we will be able to start the NM, 
recommission it, and then check for the state without worrying about the order 
of the start/recommission operations (especially if we don't have control over 
the start operation - which is the case in large companies where start 
operation is part of the OS upgrade pipeline).

The patch will look something like this:
{code:java}
    // Check if this node is a 'valid' node
    if (!this.nodesListManager.isValidNode(host) &&
        !isNodeInDecommissioning(nodeId) &&
        !this.noNMShutdownForInvalidNodes) {
      String message =
          "Disallowed NodeManager from  " + host
              + ", Sending SHUTDOWN signal to the NodeManager.";
      LOG.info(message);
      response.setDiagnosticsMessage(message);
      response.setNodeAction(NodeAction.SHUTDOWN);
      return response;
    } {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to