[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967705#comment-15967705 ]
Jun Rao commented on KAFKA-2729: -------------------------------- Thanks for the additional info. In both [~Ronghua Lin] and [~allenzhuyi]'s case, it seems ZK session expiration had happened. As I mentioned earlier in the jira, there is a known issue reported in KAFKA-3083 that when the controller's ZK session expires and loses its controller-ship, it's possible for this zombie controller to continue updating ZK and/or sending LeaderAndIsrRequests to the brokers for a short period of time. When this happens, the broker may not have the most up-to-date information about leader and isr, which can lead to subsequent ZK failure when isr needs to be updated. It may take some time to have this issue fixed. In the interim, the workaround for this issue is to make sure ZK session expiration never happens. This first thing is to figure out what's causing the ZK session to expire. Two common causes are (1) long broker GC and (2) network glitches. For (1), one needs to tune the GC in the broker properly. For (2), one can look at the reported time that the ZK client can't hear from the ZK server and increase the ZK session expiration time according. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > ----------------------------------------------------------------------- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)