Stephen O'Donnell created HDFS-14861:
----------------------------------------

             Summary: Reset LowRedundancyBlocks Iterator periodically
                 Key: HDFS-14861
                 URL: https://issues.apache.org/jira/browse/HDFS-14861
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: namenode
    Affects Versions: 3.3.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


When the namenode needs to schedule blocks for reconstruction, the blocks are 
placed into the neededReconstruction object in the BlockManager. This is an 
instance of LowRedundancyBlocks, which maintains a list of priority queues 
where the blocks are held until they are scheduled for reconstruction / 
replication.

Every 3 seconds, by default, a number of blocks are retrieved from 
LowRedundancyBlocks. The method LowRedundancyBlocks.chooseLowRedundancyBlocks() 
is used to retrieve the next set of blocks using a bookmarked iterator. Each 
call to this method moves the iterator forward. The number of blocks retrieved 
is governed by the formula:

number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration 
(default 2)

Then the namenode attempts to schedule those blocks on datanodes, but each 
datanode has a limit of how many blocks can be queued against it (controlled by 
dfs.namenode.replication.max-streams) so all of the retrieved blocks may not be 
scheduled. There may be other block availability reasons the blocks are not 
scheduled too.

As the iterator in chooseLowRedundancyBlocks() always moves forward, the blocks 
which were not scheduled are not retried until the end of the queue is reached 
and the iterator is reset.

If the replication queue is very large (eg several nodes are being 
decommissioned) or if blocks are being continuously added to the replication 
queue (eg nodes decommission using the proposal in HDFS-14854) it may take a 
very long time for the iterator to be reset to the start.

The result of this, could be a few blocks for a decommissioning or entering 
maintenance mode node getting left behind and it taking many hours or even days 
for them to be retried, and this could stop decommission completing.

With this Jira, I would like to suggest we reset the iterator after a 
configurable number of calls to chooseLowRedundancyBlocks() so any left behind 
blocks are retried.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to