Benoit Sigoure created HDFS-17722: ------------------------------------- Summary: DataNode stuck in decommissioning on standby NameNode Key: HDFS-17722 URL: https://issues.apache.org/jira/browse/HDFS-17722 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.3.6 Reporter: Benoit Sigoure
When decommissioning a DataNode in our cluster, we observed a situation where the active NameNode had marked the DataNode as decommissioned but the standby had it stuck in decommissioning state indefinitely (we waited 8h) due to a block being allegedly under replicated (note: for this path the target replication factor is 2x). The standby NameNode kept logging this in a loop: {{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: false, Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false}} Looking at the fsck report for this block, the active NameNode was reporting the following: {code:java} Block Id: blk_1486338012 Block belongs to: /path/to/file No. of Expected Replica: 2 No. of live Replica: 2 No. of excess Replica: 0 No. of stale Replica: 0 No. of decommissioned Replica: 1 No. of decommissioning Replica: 0 No. of corrupted Replica: 0 Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is HEALTHY Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is DECOMMISSIONED Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is HEALTHY {code} Whereas on the standby it says: {code:java} Block Id: blk_1486338012 Block belongs to: /path/to/file No. of Expected Replica: 2 No. of live Replica: 1 No. of excess Replica: 1 No. of stale Replica: 0 No. of decommissioned Replica: 0 No. of decommissioning Replica: 1 No. of corrupted Replica: 0 Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is HEALTHY Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is DECOMMISSIONING Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is HEALTHY {code} {code:java} hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file -rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file {code} After restarting the standby NameNode, the problem disappeared, the datanode in question transitioned to decommissioned state as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org