[ https://issues.apache.org/jira/browse/KAFKA-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aaditya Ramesh updated KAFKA-2553: ---------------------------------- Attachment: kafka_bug_report This is an example stack trace. > Kafka Consumer Hangs after Network Partition > -------------------------------------------- > > Key: KAFKA-2553 > URL: https://issues.apache.org/jira/browse/KAFKA-2553 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.8.1.1 > Environment: Amazon EC2, Ubuntu 12.04. > Reporter: Aaditya Ramesh > Assignee: Neha Narkhede > Attachments: kafka_bug_report > > > We have a Kafka consumer in an EC2 instance in Ireland that fetches data from > a kafka cluster in a datacenter in the eastern United States. We sporadically > encounter strange network partitions where we are unable to ping any machines > between the two data centers (the ping always times out), but this kind of > network partition is not too strange for inter-data center connections. > However, Kafka consumer's connection to Zookeeper never recovers after one of > these network hiccups and requires a full process restart in order to begin > consuming from the remote data center after the network has recovered. The > relevant code in ZookeeperConsumerConnector.scala catches all Throwables and > does nothing with them, which not only doesn't alert the process, but also > doesn't display any alerting metrics that we could use to diagnose such a > hung state. We therefore patched the client code in our codebase to perform a > System.exit(0) whenever this occurs, since a restart is better than failing > silently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)