Rushabh S Shah created HDFS-9558: ------------------------------------ Summary: Replication requests always blames the source datanode in case of Checksum Exception. Key: HDFS-9558 URL: https://issues.apache.org/jira/browse/HDFS-9558 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Rushabh S Shah
Replication requests from datanode (in case of rack failure event) always blames the source datanode if any of the downstream nodes encounters ChecksumException. We saw this case recently in our cluster. We lost 7 nodes in a rack. There was only one replica of the block (say on dnA). The namenode asks dnA to replicate to dnB and dnC. {noformat} 2015-12-13 21:09:41,798 [DataNode: heartbeating to NN:8020] INFO datanode.DataNode: DatanodeRegistration(dnA, datanodeUuid=bc1f183d-b74a-49c9-ab1a-d1d496ab77e9, infoPort=1006, infoSecurePort=0, ipcPort=8020, storageInfo=lv=-56;cid=CID-e7f736ac-158e-446e-9091-7e66f3cddf3c;nsid=358250775;c=1428471998571) Starting thread to transfer BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 to dnB:1004 dnC:1004 {noformat} All the packets going out from dnB's interface were getting corrupted. So dnC received corrupt block and it reported bad block (from dnA) to namenode. Following are the logs from dnC: {noformat} 2015-12-13 21:09:43,444 [DataXceiver for client at /dnB:34879 [Receiving block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] WARN datanode.DataNode: Checksum error in block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from /dnB:34879 org.apache.hadoop.fs.ChecksumException: Checksum error: at 58368 exp: -1657951272 got: 856104973 at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native Method) at org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69) at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347) at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:416) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:550) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:853) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2015-12-13 21:09:43,445 [DataXceiver for client at dnB:34879 [Receiving block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] INFO datanode.DataNode: report corrupt BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from datanode dnA:1004 to namenode {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)