[ https://issues.apache.org/jira/browse/KAFKA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287713#comment-14287713 ]
Alexey Ozeritskiy commented on KAFKA-1804: ------------------------------------------ The last time we saw the bug during restart the network switch on a cluster of 20 machines. kafka-network-threads fell down on more than half machines. As a result, the cluster became unavailable. We are trying to find the specific steps that reproduce the problem. > Kafka network thread lacks top exception handler > ------------------------------------------------ > > Key: KAFKA-1804 > URL: https://issues.apache.org/jira/browse/KAFKA-1804 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.2 > Reporter: Oleg Golovin > Priority: Critical > > We have faced the problem that some kafka network threads may fail, so that > jstack attached to Kafka process showed fewer threads than we had defined in > our Kafka configuration. This leads to API requests processed by this thread > getting stuck unresponed. > There were no error messages in the log regarding thread failure. > We have examined Kafka code to find out there is no top try-catch block in > the network thread code, which could at least log possible errors. > Could you add top-level try-catch block for the network thread, which should > recover network thread in case of exception? -- This message was sent by Atlassian JIRA (v6.3.4#6332)