We had a Kafka 0.9 consumer stuck in the epoll native call under the following circumstances.
1. It was started bootstrapped with a cluster with 3 brokers A, B and C with ids 1,2,3. 2. Change the assignment of the brokers to some topic partitions. Seek to the beginning of each topic partition. 3. NO poll calls were made at all. 4. Each of the brokers A,B and C were replaced one by one by three new brokers D, E and F with the same ids 1,2,3. The process of replacement was: 1. Shut down broker A (has id 1). 2. Bring up broker B (has id 1 i.e same as A). 3. Give it a minute odd and do the same with B and C> 5. So by this time none of the bootstrapped brokers were alive. They were all replaced. I can imagine that this would cause a problem with the new 0.9 consumer since it doesn't have a watch on the brokers directory in ZK any more. 6. Call poll finally on the consumer. Expected result - Some kind of exception or just empty results since the none of the brokers in the bootstrap list are present any more. Observed result - The poll call is just blocked in Kafka. Even though a timeout of 500ms was provided it never returned. I am not sure why this would happen but the same thing happened on 45 hosts so I am guessing this is pretty reproducible. This led to the thread just getting stuck. We had to ultimately kill -9 our processes to recover from this. Ideally a Kafka poll call with a given timeout should never block indefinitely. Here is the stack trace I was able to get: java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.$$YJP$$epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x0000000504e58468> (a sun.nio.ch.Util$2) - locked <0x0000000504e58450> (a java.util.Collections$UnmodifiableSet) - locked <0x0000000504e029d8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at sf.org.apache.kafka9.common.network.Selector.select(Selector.java:425) at sf.org.apache.kafka9.common.network.Selector.poll(Selector.java:254) at sf.org.apache.kafka9.clients.NetworkClient.poll(NetworkClient.java:256) at sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320) at sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213) at sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193) at sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134) at sf.org.apache.kafka9.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184) at sf.org.apache.kafka9.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886) at sf.org.apache.kafka9.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853) sf.org.apache.kafka9 is just our shaded jar but this is the stock Kafka 0.9 consumer code. Is this a known issue? Even though this happened under extraordinary circumstances (i.e the entire bootstrap list was replaced) blocking is ended up stalling the entire thread this code was running on. Thanks, Rajiv