HDFS-15 scalability improvements
--------------------------------

                 Key: HDFS-1851
                 URL: https://issues.apache.org/jira/browse/HDFS-1851
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: name-node
    Affects Versions: 0.21.1, 0.22.0, 0.23.0
            Reporter: Eli Collins


While discussing HDFS-15 with Todd he made the observation that 
BlockManager#isReplicationNeeded is expense at scale because it iterates over 
the list of all datanodes. If this method is in fact called frequently enough 
to be measurable, there are some places we can probably help. 

There are a couple of places where we can remove calls to isReplicationNeeded:
* In processPendingReplications, we can unconditionally add the block to 
neededReplications (w/o checking isReplicationNeeded) because a block wouldn't 
be in pendingReplications (which is where the timed out blocks come from) if it 
didn't need replication.
* BlockManager#blockHasEnoughRacks could be modified to only call 
getNumberOfRacks conditionally for the replFactor=1 case.

There are some simple improvements we could make to the method itself, eg:
* Allocate the hash set with an initial size of 2 rather than 0 because we'd 
expect a block to be available on 2 racks. ArrayList(3) might be a better 
option here since n is small. 
* It also doesn't need to check rackSet.contains before calling rackSet.add 
since rackSet is a set. 
* Keep a count numUniqueRacks (if (rackSet.add(name)) numUniqueRacks++) rather 
than call Set#size.

A couple other things I noticed while looking at the code:
* The logic in isReplicationInProgress that checks curReplicas < 
curExpectedReplicas right after isNeededReplication (a more stringent check) 
needs validating/a comment. I think the intent is to only count the block as 
under replicated only if it has an insufficient total # replicas (vs not being 
on enough racks) because we do not want to prevent the given DN from being 
decomissioned just because its blocks are not replicated across enough racks 
(which may not even be possible if there's 1 rack and a topology script is 
configured). Ie a DN could never be comissioned if a topology script is enabled 
and there's just 1 rack. Needs a test.
* A better name for shouldCheckForEnoughRacks would be isMultiRack, and 
blockNeedsReplication instead of isNeededReplication
* We should warn if a topology script is configured and only 1 rack is used, 
because blocks will never leave neededReplications in this case all blocks will 
always stay in the neededReplications queue, which is correct but may slow 
things down or consume substantial much memory.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to