ZanderXu created HDFS-16987:
-------------------------------

             Summary: HA Failover may cause some corrupted blocks
                 Key: HDFS-16987
                 URL: https://issues.apache.org/jira/browse/HDFS-16987
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: ZanderXu
            Assignee: ZanderXu


In our prod environment, we encountered an incident where HA failover caused 
some new corrupted blocks, causing some jobs to fail.

 

Traced down and found a bug in the processing of all pending DN messages when 
starting active services.

The steps to reproduce are as follows:
 # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is 
unstable
 # Timing 1, client create a file, write some data and close it.
 # Timing 2, client append this file, write some data and close it.
 # Timing 3, Standby replayed the second closing edits of this file
 # Timing 4, Standby processes the blockReceivedAndDeleted of the first create 
operation
 # Timing 5, Standby processed the blockReceivedAndDeleted of the second append 
operation
 # Timing 6, Admin switched the active namenode from NN1 to NN2
 # Timing 7, client failed to append some data to this file.

{code:java}
org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: 
lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not 
sufficiently replicated yet.
    at 
org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
    at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to