Jian Fang created HDFS-8693:
-------------------------------
Summary: refreshNamenodes does not support adding a new standby to
a running DN
Key: HDFS-8693
URL: https://issues.apache.org/jira/browse/HDFS-8693
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Jian Fang
Priority: Critical
I tried to run the following command on a Hadoop 2.6.0 cluster with HA support
$ hdfs dfsadmin -refreshNamenodes datanode-host:port
to refresh name nodes on data nodes after I replaced one name node with a new
one so that I don't need to restart the data nodes. However, I got the
following error:
refreshNamenodes: HA does not currently support adding a new standby to a
running DN. Please do a rolling restart of DNs to reconfigure the list of NNs.
I checked the 2.6.0 code and the error was thrown by the following code
snippet, which led me to this JIRA.
void refreshNNList(ArrayList<InetSocketAddress> addrs) throws IOException {
Set<InetSocketAddress> oldAddrs = Sets.newHashSet();
for (BPServiceActor actor : bpServices)
{ oldAddrs.add(actor.getNNSocketAddress()); }
Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs);
if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty())
{ // Keep things simple for now -- we can implement this at a later date. throw
new IOException( "HA does not currently support adding a new standby to a
running DN. " + "Please do a rolling restart of DNs to reconfigure the list of
NNs."); }
}
Looks like this the refreshNameNodes command is an uncompleted feature.
Unfortunately, the new name node on a replacement is critical for auto
provisioning a hadoop cluster with HDFS HA support. Without this support, the
HA feature could not really be used. I also observed that the new standby name
node on the replacement instance could stuck in safe mode because no data nodes
check in with it. Even with a rolling restart, it may take quite some time to
restart all data nodes if we have a big cluster, for example, with 4000 data
nodes, let alone restarting DN is way too intrusive and it is not a preferable
operation in production. It also increases the chance for a double failure
because the standby name node is not really ready for a failover in the case
that the current active name node fails.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)