[ https://issues.apache.org/jira/browse/HDFS-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo Nicholas Sze resolved HDFS-2420. --------------------------------------- Resolution: Not a Problem I guess that this is not a problem anymore. Please feel free to reopen this if I am wrong. Resolving ... > improve handling of datanode timeouts > ------------------------------------- > > Key: HDFS-2420 > URL: https://issues.apache.org/jira/browse/HDFS-2420 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Ron Bodkin > > If a datanode ever times out on a heart beat, it gets marked dead > permanently. I am finding that on AWS this is a periodic occurrence, i.e., > datanodes time out although the datanode process is still alive. The current > solution to this is to kill and restart each such process independently. > It would be good if there were more retry logic (e.g., blacklisting the nodes > but try heartbeats for a longer period before determining they are apparently > dead). It would also be good if refreshNodes would check and attempt to > recover timed out data nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)