Zhe Zhang created HADOOP-13657: ---------------------------------- Summary: IPC Reader thread could silently die and leave NameNode unresponsive Key: HADOOP-13657 URL: https://issues.apache.org/jira/browse/HADOOP-13657 Project: Hadoop Common Issue Type: Bug Components: ipc Reporter: Zhe Zhang Priority: Critical
For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving {{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}. We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from run time exception. Then the {{pendingConnections}} queue became full and the NameNode port became inaccessible. In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well. We should add logic to either make the {{Reader}} more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org