Improved usability around node decommissioning and block replication on 
dfshealth.jsp
-------------------------------------------------------------------------------------

                 Key: HDFS-2849
                 URL: https://issues.apache.org/jira/browse/HDFS-2849
             Project: Hadoop HDFS
          Issue Type: New Feature
          Components: documentation, name-node
    Affects Versions: 0.20.2
            Reporter: Jeff Bean


When you do this:

    - Decom a single node.
    - Underreplicated count reports all blocks.
    - Stop decom.
    - Underreplication count reduces slowly and heads to 0.

This is expected behavior of HDFS but while this is happening, utilities like 
dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and the 
node is not on the dead/decommissioned nodes list. It's therefore unclear to 
novice administrators and HDFS newbies whether or not this is a failure 
condition that needs administrative attention. 

Administrators find themselves constantly having to explain the 
under-replication number when they could be doing better things with their 
time. And they're constantly getting alarms which can be disregarded, raising 
fears of a "cry wolf" problem that the real issue gets lost in the noise.

A direct quote from such an administrator:

"When a datanode fails, it's not considered a 'decommissioning', so it does not 
show up in that list, it just simply kicks on the underrep and we have to hunt 
through the LIVE list and attempt to find out which node caused the issue. 
Obviously, we (the community) are not being told on the DEAD list when a node 
appears (why this information has to be withheld has always been an issue with 
me, how hard is it to put a date field in the DEAD list?)"

Nevertheless, we should have more information about a dying node instead of 
seeing a jump in the underrep count from 0 to millions with no real obvious 
reason. Perhaps add another column saying 'DYING NODE', anything would help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to