Extend UnderReplicatedBlocks queue to give blocks whose existing copies are all 
on a single rack priority over multi-rack blocks
--------------------------------------------------------------------------------------------------------------------------------

                 Key: HDFS-2472
                 URL: https://issues.apache.org/jira/browse/HDFS-2472
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: data-node
            Reporter: Steve Loughran
            Assignee: Steve Loughran
            Priority: Minor


Google's Availability in Globally Distributed Storage Systems paper 
[http://research.google.com/pubs/archive/36737.pdf]  shows that storage node 
failures are often correlated with other failures on the same rack. This could 
be due to rack-local failures: switches, power, etc, or operations actions 
(take down a whole rack for a rolling OS upgrade). Whatever the root cause, the 
paper argues (section 5.2) that rack-aware placement would increase the MTTF of 
a single block by a factor of three (typically). Some decisions can be made a 
block placement time, but that would be a separate issue. 

Here I propose giving priority to blocks that are under replicated and where 
all blocks are on the same rack above those blocks that are under-replicated 
and the remaining blocks are on different racks.

# Provided the failure does not take down the entire rack in one go (e.g. 
switch failure), this policy would decrease the time in which all blocks would 
be on a single rack, so reduce the consequences of a rack failure. 
# On a single-rack system, all under-replicated blocks would go into this 
queue, so the state would effectively be that of today's system.


This may make the demand for off-rack bandwidth ramp up immediately, because 
priority will be given to blocks that must be replicated off rack. I am not 
sure that it will, however, as multi-rack replication policies would generate 
the same amount of traffic to re-replicate the blocks anyway. 

The main barrier to implementing this feature is that currently the 
UnderReplicatedBlocks queues do not get provided with any information about 
block locality. We'd need to change the add() method to take information about 
where the current replicas are, and the BlockManager would have to provide some 
information as to whether the block was rack-local, not-rack local or unknown; 
the latter because the PendingReplicationBlocks structure does not know where 
things come from. "unknown" items would have to be given the same priority as 
rack-only blocks (pessimistic) or the same priority as multi-rack blocks 
(optimistic). I would be biased towards the pessimistic approach as it would 
ensure that on single-rack systems there would be no obvious change in 
behaviour. On a multi-rack system it would give priority to timed out 
PendingReplication blocks ahead of multi-rack under-replicated blocks. That may 
be a good thing in itself


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to