[ https://issues.apache.org/jira/browse/HDFS-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoqiao He resolved HDFS-15451. -------------------------------- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Thanks [~shanyu] for your report and contributions. Commit this PR to trunk. Thanks [~virajith] for your reviews! Linked to https://github.com/apache/hadoop/pull/2119. > Restarting name node stuck in safe mode when using provided storage > ------------------------------------------------------------------- > > Key: HDFS-15451 > URL: https://issues.apache.org/jira/browse/HDFS-15451 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.2.1, 3.1.3 > Reporter: shanyu zhao > Assignee: shanyu zhao > Priority: Major > Fix For: 3.4.0 > > > When HDFS provided storage is used (dfs.namenode.provided.enabled=true), > sometimes restarting name node will result in it stuck at safe mode. > The problem is that data node send block report to name node successfully, > but name node is not processing the report properly, then HDFS remains in > safe mode due to missing blocks. > Looking at name node log, this is the sequence of log for a specific data > node: > {code} > 2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: > Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866). > 2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: > Created a new BR lease 0x476aaae689ebbc01 for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084. numPending = 4 > 2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport > 0xcc610f42d0218cd9: discarded non-initial block report from > DatanodeRegistration(10.244.6.131:9866, > datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, > infoSecurePort=9865, ipcPort=9867, > storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) > because namenode still in startup phase > 2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR > lease 0x476aaae689ebbc01 is not valid for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending > set. > {code} > The root cause is when BlockManager is processing report, it will skip > processing when storageInfo.getBlockReportCount() > 0 and remove the lease: > {code} > blockReportLeaseManager.removeLease(node) > {code} > This is because every data node will report a DS-PROVIDED storage, along with > other storages (like DISK storage). All DS -PROVIDED storages are actually > pointing to the same storageInfo, therefore the second data node sending > block report with DS-PROVIDED will have blockReportCount > 0. Then the lease > is removed for the data node, then processing future block reports from this > node will fail at checkLease() with message "BR lease is not valid". -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org