Nikola Vujic created HDFS-5846:
----------------------------------

             Summary: Assigning DEFAULT_RACK in resolveNetworkLocation method 
can break data resiliency
                 Key: HDFS-5846
                 URL: https://issues.apache.org/jira/browse/HDFS-5846
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Nikola Vujic
            Assignee: Nikola Vujic


Medhod CachedDNSToSwitchMapping::resolve() can return NULL which requires 
careful handling. Null can be returned in two cases:
• An error occurred with topology script execution (script crashes).
• Script returns wrong number of values (other than expected)

Critical handling is in the DN registration code. DN registration code is 
responsible for assigning proper topology paths to all registered datanodes. 
Existing code handles this NULL pointer on the following way 
({{resolveNetworkLocation}} method):
{code}
/ /resolve its network location
    List<String> rName = dnsToSwitchMapping.resolve(names);
    String networkLocation;
    if (rName == null) {
      LOG.error("The resolve call returned null! Using " + 
          NetworkTopology.DEFAULT_RACK + " for host " + names);
      networkLocation = NetworkTopology.DEFAULT_RACK;
    } else {
      networkLocation = rName.get(0);
    }
    return networkLocation;
{code}

The line of code that is assigning default rack:
{code} networkLocation = NetworkTopology.DEFAULT_RACK; {code} 
can cause a serious problem. This means if somehow we got NULL, then the 
default rack will be assigned as a DN's network location and DN's registration 
will finish successfully. Under this circumstances, we will be able to load 
data into cluster which is working with a wrong topology. Wrong  topology means 
that fault domains are not honored. 

For the end user, it means that two data replicas can end up in the same fault 
domain and a single failure can cause loss of two, or more, replicas. Cluster 
would be in the inconsistent state but it would not be aware of that and the 
whole thing would work as if everything was fine. We can notice that something 
wrong happened almost only by looking in the log for the error:
{code}
LOG.error("The resolve call returned null! Using " + 
NetworkTopology.DEFAULT_RACK + " for host " + names);
{code}
 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to