Tao Yang created HDFS-15105: ------------------------------- Summary: Standby NN exits and fails to restart due to edit log corruption Key: HDFS-15105 URL: https://issues.apache.org/jira/browse/HDFS-15105 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.8.0 Reporter: Tao Yang
We found a issue that Standby NN exited and failed to restart until we resolved the edit log corruption. Error logs: {noformat} java.io.IOException: Mismatched block IDs or generation stamps, attempting to replace block blk_74288647857_73526148211 with blk_74288647857_73526377369 as block # 15/17 of /maindump/mainv10/dump_online/lasttable/20200105015500/part-319 at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1019) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:431) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:234) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:885) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:866) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:234) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:342) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:295) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:312) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:455) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:308) {noformat} Related edit log transactions of the same file: {noformat} 1. TXID=444341628498 time=1578251449632 OP_UPDATE_BLOCKS blocks: ... blk_74288647857_73526148211 blk_74454090866_73526215536 2. TXID=444342382774 time=1578251520740 OP_REASSIGN_LEASE 3. TXID=444342401216 time=1578251522779 OP_CLOSE blocks: ... blk_74288647857_73526377369 blk_74454090866_73526374095 4. TXID=444342401394 OP_SET_GENSTAMP_V2 generate stamp: 73526377369 !!!! this stamp is generated but already used in the previous edit log 5. TXID=444342401395 time=1578251522835 (03:12:02,835) OP_TRUNCATE 6. TXID=444342402176 time=1578251523246 (03:12:03,246) OP_CLOSE blocks: ... blk_74288647857_73526377369 {noformat} According to the edit logs, it's wield to see that stamp(73526377369) was generated in transaction 4 but already used in transaction 3, and for transaction 3 there should be only the last block changed but in fact the last two blocks are both changed. This problem might be produced in a complex scenario that truncate operation immediately followed the recover-lease operation for the same file. A suspicious point is that between creation and being written for transaction 3, stamp of the second last block was updated when committing block synchronization caused by the truncate operation. Related calling stack is as follows: {noformat} NameNodeRpcServer#commitBlockSynchronization FSNamesystem#commitBlockSynchronization // update last block if(!copyTruncate) { storedBlock.setGenerationStamp(newgenerationstamp); //updated now!!! storedBlock.setNumBytes(newlength); } {noformat} Any comments are welcome. Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org