Aaditya Ramesh created KAFKA-2553: ------------------------------------- Summary: Kafka Consumer Hangs after Network Partition Key: KAFKA-2553 URL: https://issues.apache.org/jira/browse/KAFKA-2553 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.8.1.1 Environment: Amazon EC2, Ubuntu 12.04. Reporter: Aaditya Ramesh Assignee: Neha Narkhede Attachments: kafka_bug_report
We have a Kafka consumer in an EC2 instance in Ireland that fetches data from a kafka cluster in a datacenter in the eastern United States. We sporadically encounter strange network partitions where we are unable to ping any machines between the two data centers (the ping always times out), but this kind of network partition is not too strange for inter-data center connections. However, Kafka consumer's connection to Zookeeper never recovers after one of these network hiccups and requires a full process restart in order to begin consuming from the remote data center after the network has recovered. The relevant code in ZookeeperConsumerConnector.scala catches all Throwables and does nothing with them, which not only doesn't alert the process, but also doesn't display any alerting metrics that we could use to diagnose such a hung state. We therefore patched the client code in our codebase to perform a System.exit(0) whenever this occurs, since a restart is better than failing silently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)