Consider redesign of block report processing
--------------------------------------------

                 Key: HDFS-1667
                 URL: https://issues.apache.org/jira/browse/HDFS-1667
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: name-node
    Affects Versions: 0.22.0
            Reporter: Matt Foley
            Assignee: Matt Foley


The current implementation of block reporting has the datanode send the entire 
list of blocks resident on that node in every report, hourly.  
BlockManager.processReport in the namenode runs each block report through 
reportDiff() first, to build four linked lists of replicas for different 
dispositions, then processes each list.  During that process, every block 
belonging to the node (even the unchanged blocks) are removed and re-linked in 
that node's blocklist.  The entire process happens under the global 
FSNamesystem write lock, so there is essentially zero concurrency.  It takes 
about 90 milliseconds to process a single 50,000-replica block report in the 
namenode, during which no other read or write activities can occur.

There are several opportunities for improvement in this design.  In order of 
probable benefit, they include:
1. Change the datanode to send a differential report.  This saves the namenode 
from having to do most of the work in reportDiff(), and avoids the need to 
re-link all the unchanged blocks during the "diff" calculation.
2. Keep the linked lists of "to do" work, but modify reportDiff() so that it 
can be done under a read lock instead of the write lock.  Then only the 
processing of the lists needs to be under the write lock.  Since the number of 
blocks changed is usually much smaller than the number unchanged, this should 
improve concurrency.
3. Eliminate the linked lists and just immediately process each changed block 
as it is read from the block report.  The work on HDFS-1295 indicates that this 
might save a large fraction of the total block processing time at scale, due to 
the much smaller number of objects created and garbage collected during 
processing of hundreds of millions of blocks.
4. As a sidelight to #3, remove linked list use from BlockManager.addBlock().  
It currently uses linked lists as an argument to processReportedBlock() even 
though addBlock() only processes a single replica on each call.  This should be 
replaced with a call to immediately process the block instead of enqueuing it.



-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to