[jira] [Created] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

ZanderXu (Jira) Fri, 27 May 2022 21:29:09 -0700

ZanderXu created HDFS-16601:
-------------------------------

             Summary: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try
                 Key: HDFS-16601
                 URL: https://issues.apache.org/jira/browse/HDFS-16601
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: ZanderXu
            Assignee: ZanderXu



In our production environment, we found a bug and stack like:

{code:java}
java.io.IOException: Failed to replace a bad datanode on the existing pipeline 
due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
 
original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
 
DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
        at 
org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
        at 
org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
        at 
org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
{code}

And the root cause is that DFSClient cannot  perceive the exception of 
TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
DFSClient will retry all datanodes in the cluster and then failed.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Reply via email to