Wei-Chiu Chuang created HDFS-10777:
--------------------------------------

             Summary: DataNode should report&remove volume failures if DU 
cannot access files
                 Key: HDFS-10777
                 URL: https://issues.apache.org/jira/browse/HDFS-10777
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.8.0
            Reporter: Wei-Chiu Chuang
            Assignee: Wei-Chiu Chuang


HADOOP-12973 refactored DU and makes it pluggable. The refactory has a 
side-effect that if DU encounters an exception, the exception is caught, logged 
and ignored, essentially fixes HDFS-9908 (in which case runaway exceptions 
prevent DataNodes from handshaking with NameNodes).

However, this "fix" is not good, in the sense that if the disk is bad, there is 
no immediate action made by the DataNode other than logging the exception. 
Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few 
number of directories blindly. If a disk goes bad, it is often possible that 
only a few files are bad initially and that by checking only a small number of 
directories it is easy to overlook the degraded disk.

I propose: in addition to logging the exception, DataNode should proactively 
verify the files are not accessible, remove the volume, and make the failure 
visible by showing it in JMX, so that administrators can spot the failure via 
monitoring systems.

A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to