[ https://issues.apache.org/jira/browse/HDFS-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chen Zhang reopened HDFS-14652: ------------------------------- missed properties in core-default.xml > HealthMonitor connection retry times should be configurable > ----------------------------------------------------------- > > Key: HDFS-14652 > URL: https://issues.apache.org/jira/browse/HDFS-14652 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Chen Zhang > Assignee: Chen Zhang > Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14652-001.patch, HDFS-14652-002.patch > > > On our production HDFS cluster, some client's burst requests cause the tcp > kernel queue full on NameNode's host, since the configuration value of > "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the > ZooKeeper Healthmonitor got an connection error like this: > {code:java} > WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to > monitor health of NameNode at nn_host_name/ip_address:port: Call From > zkfc_host_name/ip to nn_host_name:port failed on connection exception: > java.net.ConnectException: Connection timed out; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > {code} > This error caused a failover and affects the availability of that cluster, we > fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6 > But during working on this issue, we found that the connection retry > time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I > think it should be configurable, then if we don't want the health-monitor so > sensitive, we can change it's behavior by change this configuration -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org