Ratandeep Ratti created HDFS-6681:
-------------------------------------

             Summary: 
TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN is flaky 
and sometimes gets stuck in infinite loops
                 Key: HDFS-6681
                 URL: https://issues.apache.org/jira/browse/HDFS-6681
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.4.1
            Reporter: Ratandeep Ratti




This testcase has 3 infinite loops which break only on certain conditions being 
satisfied.

1st loop checks if there should be a single live replica. It assumes this to be 
true since it has just corrupted a block on one of the datanodes (testcase has 
replication factor as 2). One scenario in which this loop will never break is 
if the Namenode invalidates the corrupt replica, schedules a replication 
command, and the new copied replica is added all before this testcase has the 
chance to check the live-replica count.

2nd loop checks there should be 2 live replicas. It assumes this to be true (in 
some time) since the first loop has broken implying there is a single replica 
and now it is only a matter of time when the Namenode schedules a replication 
command to copy a replica to another datanode. One scenario in which this loop 
will never break is when the Namenode tries to schedule a new replica on the 
same node on which we actually corrupted the block. That dst. datanode will not 
copy the block, complaining that it already has the (corrupted) replica in the 
create state. The situation that results is that Namenode has scheduled a copy 
to a datanode, the block is now in the namenode's pending replication queue, 
this block will never be removed from the pending replication queue because the 
namenode will never receive a report from the datanodes that the block is 
'added'.

Note: The block can be transferred from the 'pending replication' to "needed 
replication" queue once the pending timeout (5 minutes) expires. The Namenode 
then actively tries to schedule a replication for blocks in 'needed 
replication' queue. This can cause the 2nd loop to break but the time in which 
this process gets kicked in is more than 5 minutes.

3rd loop: This loops checks if there are no corrupt replicas. I don't see a 
scenario in which this loop can go on for ever, since once the live replica 
count goes back to normal (2), the corrupted block will be removed

I guess increasing the heart beat interval time, so that the testcase has 
enough time to check condition in loop 1 before a datanode reports a successful 
copy should help avoid race condition in loop1. Regarding loop2 I guess we can 
reduce the timeout after which the block is transferred from the pending 
replication to the needed replication queue.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to