HDFS-15 scalability improvements -------------------------------- Key: HDFS-1851 URL: https://issues.apache.org/jira/browse/HDFS-1851 Project: Hadoop HDFS Issue Type: Improvement Components: name-node Affects Versions: 0.21.1, 0.22.0, 0.23.0 Reporter: Eli Collins
While discussing HDFS-15 with Todd he made the observation that BlockManager#isReplicationNeeded is expense at scale because it iterates over the list of all datanodes. If this method is in fact called frequently enough to be measurable, there are some places we can probably help. There are a couple of places where we can remove calls to isReplicationNeeded: * In processPendingReplications, we can unconditionally add the block to neededReplications (w/o checking isReplicationNeeded) because a block wouldn't be in pendingReplications (which is where the timed out blocks come from) if it didn't need replication. * BlockManager#blockHasEnoughRacks could be modified to only call getNumberOfRacks conditionally for the replFactor=1 case. There are some simple improvements we could make to the method itself, eg: * Allocate the hash set with an initial size of 2 rather than 0 because we'd expect a block to be available on 2 racks. ArrayList(3) might be a better option here since n is small. * It also doesn't need to check rackSet.contains before calling rackSet.add since rackSet is a set. * Keep a count numUniqueRacks (if (rackSet.add(name)) numUniqueRacks++) rather than call Set#size. A couple other things I noticed while looking at the code: * The logic in isReplicationInProgress that checks curReplicas < curExpectedReplicas right after isNeededReplication (a more stringent check) needs validating/a comment. I think the intent is to only count the block as under replicated only if it has an insufficient total # replicas (vs not being on enough racks) because we do not want to prevent the given DN from being decomissioned just because its blocks are not replicated across enough racks (which may not even be possible if there's 1 rack and a topology script is configured). Ie a DN could never be comissioned if a topology script is enabled and there's just 1 rack. Needs a test. * A better name for shouldCheckForEnoughRacks would be isMultiRack, and blockNeedsReplication instead of isNeededReplication * We should warn if a topology script is configured and only 1 rack is used, because blocks will never leave neededReplications in this case all blocks will always stay in the neededReplications queue, which is correct but may slow things down or consume substantial much memory. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira