[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891355#comment-15891355 ]
mjuarez commented on KAFKA-2729: -------------------------------- We are also running into this problem in our staging cluster, running Kafka 0.10.0.1. Basically it looks like this happened yesterday: {noformat} [2017-02-28 18:41:33,513] INFO Client session timed out, have not heard from server in 7799ms for sessionid 0x159d7893eab0088, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) {noformat} I'm attributing that to a transient network issue, since we haven't seen any other issues. And less than a minute later, we started seeing these errors: {noformat} [2017-02-28 18:42:45,739] INFO Partition [analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Shrinking ISR for partition [analyticsInfrastructure_KafkaAvroUserMessage,16] from 102,101,105 to 101 (kaf [2017-02-28 18:42:45,751] INFO Partition [analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Cached zkVersion [94] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,751] INFO Partition [qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Shrinking ISR for partition [qa_exporter11_slingshot_salesforce_invoice,6] from 101,105,104 to 101 (kafka.clu [2017-02-28 18:42:45,756] INFO Partition [qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,756] INFO Partition [GNRDEV_counters_singleCount,2] on broker 101: Shrinking ISR for partition [GNRDEV_counters_singleCount,2] from 101,105,104 to 101 (kafka.cluster.Partition) [2017-02-28 18:42:45,761] INFO Partition [GNRDEV_counters_singleCount,2] on broker 101: Cached zkVersion [334] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,761] INFO Partition [sod-spins-spark-local,1] on broker 101: Shrinking ISR for partition [sod-spins-spark-local,1] from 101,103,104 to 101 (kafka.cluster.Partition) [2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,1] on broker 101: Cached zkVersion [379] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,11] on broker 101: Shrinking ISR for partition [sod-spins-spark-local,11] from 102,101,105 to 101 (kafka.cluster.Partition) [2017-02-28 18:42:45,767] INFO Partition [sod-spins-spark-local,11] on broker 101: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {noformat} The "current" server is 101. So it thinks it's the leader for basically every partition on that node, but it's refusing to update the ISRs, because the cached zkversion doesn't match the one in zookeeper. This is causing permanently under-replicated partitions, because server doesn't ever catch up, since it doesn't think there's a problem. Also, the metadata reported by the 101 server to consumers indicates it thinks it's part of the ISR, but every other broker doesn't think so. Let me know if more logs/details would be helpful. I'll try to fix this by restarting the node, and hopefully it fixes the issue. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > ----------------------------------------------------------------------- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)