PengZhang created HDFS-4660: ------------------------------- Summary: Duplicated checksum on DN in a recovered pipeline Key: HDFS-4660 URL: https://issues.apache.org/jira/browse/HDFS-4660 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.3-alpha Reporter: PengZhang Priority: Critical
pipeline DN1 DN2 DN3 stop DN2 Add a node DN4 in 2nd position in pipeline DN1 DN4 DN3 recover RBW DN4 after recover rbw 2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004 2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW getNumBytes() = 134144 getBytesOnDisk() = 134144 getVisibleLength()= 134144 end at chunk (134144/512=262) DN3 after recover rbw 2013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW getNumBytes() = 134028 getBytesOnDisk() = 134028 getVisibleLength()= 134028 client send packet after recover pipeline offset=133632 len=1008 DN4 after flush 2013-04-01 21:02:31,779 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1063 // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 1063. DN3 after flush 2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, lastPacketInBlock=false, offsetInBlock=134640, ackEnqueueNanoTime=8817026136871545) 2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing meta file offset of block BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 1055 to 1051 2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1059 After checking meta on DN4, I found checksum of chunk 262 is duplicated, but data not. Later after block was finalized, DN4's scanner detected bad block, and then reported it to NM. NM send a command to delete this block, and replicate this block from other DN in pipeline to satisfy duplication num. I think this is because in BlockReceiver it skips data bytes already written, but not skips checksum bytes already written. And function adjustCrcFilePosition is only used for last non-completed chunk, but not for this situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira