shanyu zhao created HDFS-15451:
----------------------------------

             Summary: Restarting name node stuck in safe mode when using 
provided storage
                 Key: HDFS-15451
                 URL: https://issues.apache.org/jira/browse/HDFS-15451
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.1.3, 3.2.1
            Reporter: shanyu zhao


When HDFS provided storage is used (dfs.namenode.provided.enabled=true), 
sometimes restarting name node will result in it stuck at safe mode.

The problem is that data node send block report to name node successfully, but 
name node is not processing the report properly, then HDFS remains in safe mode 
due to missing blocks.

Looking at name node log, this is the sequence of log for a specific data node:

{code}
2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: 
Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866).
2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: Created 
a new BR lease 0x476aaae689ebbc01 for DN af19d9e0-7b9b-45e0-9aa6-b2f404098084.  
numPending = 4
2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport 
0xcc610f42d0218cd9: discarded non-initial block report from 
DatanodeRegistration(10.244.6.131:9866, 
datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, 
infoSecurePort=9865, ipcPort=9867, 
storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451)
 because namenode still in startup phase
2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR lease 
0x476aaae689ebbc01 is not valid for DN af19d9e0-7b9b-45e0-9aa6-b2f404098084, 
because the DN is not in the pending set.
{code}

The root cause is when BlockManager is processing report, it will skip 
processing when storageInfo.getBlockReportCount() > 0 and remove the lease:
{code}
blockReportLeaseManager.removeLease(node)
{code}
This is because every data node will report a DS-PROVIDED storage, along with 
other storages (like DISK storage). All DS -PROVIDED storages are actually 
pointing to the same storageInfo, therefore the second data node sending block 
report with DS-PROVIDED will have blockReportCount > 0. Then the lease is 
removed for the data node, then processing future block reports from this node 
will fail at checkLease() with message "BR lease is not valid".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to