Gargi Jaiswal created HDDS-14725:
------------------------------------

             Summary: CLI commands querying SCM appear to hang when SCM is down 
due to silent client retries
                 Key: HDDS-14725
                 URL: https://issues.apache.org/jira/browse/HDDS-14725
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Gargi Jaiswal
            Assignee: Gargi Jaiswal


When all SCM instances are down or unreachable, CLI commands that query SCM 
(e.g. {color:#de350b}{{{}ozone admin datanode list{}}}, {{{}decommission{}}}, 
{{{}diskbalancer{}}}, {{{}usageinfo{}}}, {{maintenance}}{color}, etc.) appear 
to hang for up to *~10–15 minutes* before failing.

This is due to SCM client retry configuration:

 
{code:java}
hdds.scmclient.rpc.timeout = 15m
hdds.scmclient.max.retry.timeout = 10m
hdds.scmclient.retry.interval = 2s
hdds.scmclient.max.retry = 15
{code}
 

Retries are happening internally, but no feedback is shown on the CLI during 
this period, creating the impression that the command is stuck and shows error 
after 15mins.

In contrast, in HDFS (e.g. when NameNodes are down), retry attempts are logged 
immediately via {{{}*RetryInvocationHandler*{}}}, providing clear user feedback.

*Proposed Fix:*

Make Retry logs to be shown up in the cli output to 
{color:#de350b}{{stderr}}{color}{{ }}{{in 
}}{color:#de350b}{{SCMFailoverProxyProviderBase}}{color}{{ }}

{{example:}}
{code:java}

{code}
{{// Current behaviour with scm's down}}
{{bash-5.1$ ozone admin datanode list
<----------- Seems as stuck for 15mins with no cli error message ----------->}}
{code:java}

{code}
{{// Proposed fix for all commands querying scm}}
{{bash-5.1$ ozone admin datanode list
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 1 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 2 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 3 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 4 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 5 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 6 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 7 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 8 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 9 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 10 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 11 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 12 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm2 after 13 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm3 after 14 failover attempt(s). 
Trying to failover after sleeping for 2000ms.
UnknownHostException: Invalid host name, while invoking 
StorageContainerLocationProtocolPB over scm1 after 15 failover attempt(s). 
Trying to failover after sleeping for 2000ms.}}
{{Invalid host name: local host is: "om1/172.18.0.4"; destination host is: 
"scm1":9860; java.net.UnknownHostException: Invalid host name: local host is: 
"om1/172.18.0.4"; destination host is: "scm1":9860; 
java.net.UnknownHostException; For more details see: 
http://wiki.apache.org/hadoop/UnknownHost; For more details see: 
http://wiki.apache.org/hadoop/UnknownHost }}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to