Extend UnderReplicatedBlocks queue to give blocks whose existing copies are all on a single rack priority over multi-rack blocks --------------------------------------------------------------------------------------------------------------------------------
Key: HDFS-2472 URL: https://issues.apache.org/jira/browse/HDFS-2472 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Google's Availability in Globally Distributed Storage Systems paper [http://research.google.com/pubs/archive/36737.pdf] shows that storage node failures are often correlated with other failures on the same rack. This could be due to rack-local failures: switches, power, etc, or operations actions (take down a whole rack for a rolling OS upgrade). Whatever the root cause, the paper argues (section 5.2) that rack-aware placement would increase the MTTF of a single block by a factor of three (typically). Some decisions can be made a block placement time, but that would be a separate issue. Here I propose giving priority to blocks that are under replicated and where all blocks are on the same rack above those blocks that are under-replicated and the remaining blocks are on different racks. # Provided the failure does not take down the entire rack in one go (e.g. switch failure), this policy would decrease the time in which all blocks would be on a single rack, so reduce the consequences of a rack failure. # On a single-rack system, all under-replicated blocks would go into this queue, so the state would effectively be that of today's system. This may make the demand for off-rack bandwidth ramp up immediately, because priority will be given to blocks that must be replicated off rack. I am not sure that it will, however, as multi-rack replication policies would generate the same amount of traffic to re-replicate the blocks anyway. The main barrier to implementing this feature is that currently the UnderReplicatedBlocks queues do not get provided with any information about block locality. We'd need to change the add() method to take information about where the current replicas are, and the BlockManager would have to provide some information as to whether the block was rack-local, not-rack local or unknown; the latter because the PendingReplicationBlocks structure does not know where things come from. "unknown" items would have to be given the same priority as rack-only blocks (pessimistic) or the same priority as multi-rack blocks (optimistic). I would be biased towards the pessimistic approach as it would ensure that on single-rack systems there would be no obvious change in behaviour. On a multi-rack system it would give priority to timed out PendingReplication blocks ahead of multi-rack under-replicated blocks. That may be a good thing in itself -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira