Ming Ma created HDFS-7128:
-----------------------------

             Summary: Decommission slows way down when it gets towards the end
                 Key: HDFS-7128
                 URL: https://issues.apache.org/jira/browse/HDFS-7128
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


When we decommission nodes across different racks, the decommission process 
becomes really slow at the end, hardly making any progress. The problem is some 
blocks are on 3 decomm-in-progress DNs and the way how replications are 
scheduled caused unnecessary delay. Here is the analysis.

When BlockManager schedules the replication work from neededReplication, it 
first needs to pick the source node for replication via chooseSourceDatanode. 
The core policies to pick the source node are:

1. Prefer decomm-in-progress node.

2. Only pick the nodes whose outstanding replication counts are below 
thresholds dfs.namenode.replication.max-streams or 
dfs.namenode.replication.max-streams-hard-limit, based on the replication 
priority.


When we decommission nodes,

1. All the decommission nodes' blocks will be added to neededReplication.

2. BM will pick X number of blocks from neededReplication in each iteration. X 
is based on cluster size and some configurable multiplier. So if the cluster 
has 2000 nodes, X will be around 4000.

3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up 
being chosen as the source node of all these 4000 nodes. The reason the 
outstanding replication thresholds don't kick is due to the implementation of 
BlockManager.computeReplicationWorkForBlocks; 
node.getNumberOfBlocksToBeReplicated() remains zero given 
node.addBlockToBeReplicated is called after source node iteration.

{noformat}
...
      synchronized (neededReplications) {
        for (int priority = 0; priority < blocksToReplicate.size(); priority++) 
{
...
chooseSourceDatanode
...
        }


      for(ReplicationWork rw : work){
...
          rw.srcNode.addBlockToBeReplicated(block, targets);
...
      }
{noformat}
 
4. So several decomm-in-progress nodes A, B, C end up with 4000 
node.getNumberOfBlocksToBeReplicated().

5. If we assume each node can replicate 5 blocks per minutes, it is going to 
take 800 minutes to finish replication of these blocks.

6. Pending replication timeout kick in after 5 minutes. The items will be 
removed from the pending replication queue and added back to neededReplication. 
The replications will then be handled by other source nodes of these blocks. 
But the blocks still remain in nodes A, B, C's pending replication queue, 
DatanodeDescriptor.replicateBlocks, so A, B, C continue the replications of 
these blocks, although these blocks might have been replicated by other DNs 
after replication timeout.

7. Some block' replicas exist on A, B, C and it is at the end of A's pending 
replication queue. Even though the block's replication timeout, no source node 
can be chosen given A, B, C all have high pending replication count. So we have 
to wait until A drains its pending replication queue. Meanwhile, the items in 
A's pending replication queue have been taken care of by other nodes and no 
longer under replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to