Exceptions in DataXceiver#run can result in a zombie datanode
--------------------------------------------------------------
Key: HDFS-2182
URL: https://issues.apache.org/jira/browse/HDFS-2182
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node
Reporter: Eli Collins
Fix For: 0.23.0
DataXceiver#run currently swallows all exceptions, it should instead plumb them
up to DataXceiverServer#run so it can decide whether the exception should be
tolerated or the daemon should exit. An IOE should be tolerated (because it's
likely just an issue with a particular thread, or an intermittent failure), as
it is today, but eg j.l.Error should be not.
This came up in the following bug I'm seeing on a test cluster: if there's eg a
NoClassDefFoundError thrown in DataXceiver#run (because the host jars were
replaced out from underneath it, it ran out of descriptors, etc.) we'll end up
with a datanode that is alive but always fails because it can't create any
DataXceiver threads. In this case the datanode should shut itself down rather
than continue to run.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira