ZanderXu created HDFS-16987: ------------------------------- Summary: HA Failover may cause some corrupted blocks Key: HDFS-16987 URL: https://issues.apache.org/jira/browse/HDFS-16987 Project: Hadoop HDFS Issue Type: Bug Reporter: ZanderXu Assignee: ZanderXu
In our prod environment, we encountered an incident where HA failover caused some new corrupted blocks, causing some jobs to fail. Traced down and found a bug in the processing of all pending DN messages when starting active services. The steps to reproduce are as follows: # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is unstable # Timing 1, client create a file, write some data and close it. # Timing 2, client append this file, write some data and close it. # Timing 3, Standby replayed the second closing edits of this file # Timing 4, Standby processes the blockReceivedAndDeleted of the first create operation # Timing 5, Standby processed the blockReceivedAndDeleted of the second append operation # Timing 6, Admin switched the active namenode from NN1 to NN2 # Timing 7, client failed to append some data to this file. {code:java} org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not sufficiently replicated yet. at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org