Nikola Vujic created HDFS-5846:
----------------------------------
Summary: Assigning DEFAULT_RACK in resolveNetworkLocation method
can break data resiliency
Key: HDFS-5846
URL: https://issues.apache.org/jira/browse/HDFS-5846
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
Medhod CachedDNSToSwitchMapping::resolve() can return NULL which requires
careful handling. Null can be returned in two cases:
• An error occurred with topology script execution (script crashes).
• Script returns wrong number of values (other than expected)
Critical handling is in the DN registration code. DN registration code is
responsible for assigning proper topology paths to all registered datanodes.
Existing code handles this NULL pointer on the following way
({{resolveNetworkLocation}} method):
{code}
/ /resolve its network location
List<String> rName = dnsToSwitchMapping.resolve(names);
String networkLocation;
if (rName == null) {
LOG.error("The resolve call returned null! Using " +
NetworkTopology.DEFAULT_RACK + " for host " + names);
networkLocation = NetworkTopology.DEFAULT_RACK;
} else {
networkLocation = rName.get(0);
}
return networkLocation;
{code}
The line of code that is assigning default rack:
{code} networkLocation = NetworkTopology.DEFAULT_RACK; {code}
can cause a serious problem. This means if somehow we got NULL, then the
default rack will be assigned as a DN's network location and DN's registration
will finish successfully. Under this circumstances, we will be able to load
data into cluster which is working with a wrong topology. Wrong topology means
that fault domains are not honored.
For the end user, it means that two data replicas can end up in the same fault
domain and a single failure can cause loss of two, or more, replicas. Cluster
would be in the inconsistent state but it would not be aware of that and the
whole thing would work as if everything was fine. We can notice that something
wrong happened almost only by looking in the log for the error:
{code}
LOG.error("The resolve call returned null! Using " +
NetworkTopology.DEFAULT_RACK + " for host " + names);
{code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)