Xiaoqiao He created HDFS-15746: ---------------------------------- Summary: Standby NameNode crash when replay editlog Key: HDFS-15746 URL: https://issues.apache.org/jira/browse/HDFS-15746 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Xiaoqiao He Assignee: Xiaoqiao He
Standby NameNode meet NPE and crash when replay editlog, After dig log and source code, Not found the root cause. But some information may be useful for this case. a. before SBN crash, ANN do one lease recovery. {code:java} 2020-12-23 12:37:45,946 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: $PATH has not been closed. Lease recovery is in progress. RecoveryId = 21696709510 for block blk_*_21658833701 {code} b. then one Datanode Volumn failed which manage one replica of blk_*_21658833701 after lease recovery. c. after half one hour, SBN crash because NPE as following. {code:java} 2020-12-23 13:13:36,703 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=$PATH, replication=3, mtime=1608698268201, atime=1608343529481, blockSize=268435456, blocks=[blk_$i_$j], permissions=user:group:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=$txid] java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455) at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297) 2020-12-23 13:13:36,703 ERROR org.apache.hadoop.ipc.Server: Error in Reader java.nio.channels.ClosedChannelException at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1053) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1034) 2020-12-23 13:13:36,703 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.16.39.26:50010 is added to blk_22374572883_21672067156 size 58762255 2020-12-23 13:13:36,704 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN. java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:254) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297) Caused by: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455) at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244) ... 12 more {code} Not very clear about the relation between [lease recovery/volumn failed/sbn crash], but I think we should catch null when remove stale Replicas to avoid this fatal. Our production version is 2.*, and IMO this issue also exist at trunk. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org