[jira] [Created] (HDFS-17722) DataNode stuck in decommissioning on standby NameNode

Benoit Sigoure (Jira) Fri, 31 Jan 2025 08:21:06 -0800

Benoit Sigoure created HDFS-17722:
-------------------------------------

             Summary: DataNode stuck in decommissioning on standby NameNode
                 Key: HDFS-17722
                 URL: https://issues.apache.org/jira/browse/HDFS-17722
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.3.6
            Reporter: Benoit Sigoure



When decommissioning a DataNode in our cluster, we observed a situation where 
the active NameNode had marked the DataNode as decommissioned but the standby 
had it stuck in decommissioning state indefinitely (we waited 8h) due to a 
block being allegedly under replicated (note: for this path the target 
replication factor is 2x).  The standby NameNode kept logging this in a loop:

{{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: 
blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt 
replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: false, 
Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 
10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode 
decommissioning: true, Is current datanode entering maintenance: false}}

Looking at the fsck report for this block, the active NameNode was reporting 
the following:
{code:java}
Block Id: blk_1486338012
Block belongs to: /path/to/file
No. of Expected Replica: 2
No. of live Replica: 2
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommissioned Replica: 1
No. of decommissioning Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
HEALTHY
Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
DECOMMISSIONED
Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
HEALTHY
{code}
Whereas on the standby it says:
{code:java}
Block Id: blk_1486338012
Block belongs to: /path/to/file
No. of Expected Replica: 2
No. of live Replica: 1
No. of excess Replica: 1
No. of stale Replica: 0
No. of decommissioned Replica: 0
No. of decommissioning Replica: 1
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
HEALTHY
Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
DECOMMISSIONING
Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
HEALTHY
{code}
{code:java}
hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file
-rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file
{code}
After restarting the standby NameNode, the problem disappeared, the datanode in 
question transitioned to decommissioned state as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-17722) DataNode stuck in decommissioning on standby NameNode

Reply via email to