[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

mjuarez (JIRA) Wed, 01 Mar 2017 16:23:16 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891355#comment-15891355
 ]


mjuarez commented on KAFKA-2729:
--------------------------------

We are also running into this problem in our staging cluster, running Kafka 
0.10.0.1.  Basically it looks like this happened yesterday: 

{noformat}
[2017-02-28 18:41:33,513] INFO Client session timed out, have not heard from 
server in 7799ms for sessionid 0x159d7893eab0088, closing socket connection and 
attempting reconnect (org.apache.zookeeper.ClientCnxn)
{noformat}

I'm attributing that to a transient network issue, since we haven't seen any 
other issues.  And less than a minute later, we started seeing these errors:

{noformat}
[2017-02-28 18:42:45,739] INFO Partition 
[analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Shrinking ISR 
for partition [analyticsInfrastructure_KafkaAvroUserMessage,16] from 
102,101,105 to 101 (kaf
[2017-02-28 18:42:45,751] INFO Partition 
[analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Cached 
zkVersion [94] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-02-28 18:42:45,751] INFO Partition 
[qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Shrinking ISR for 
partition [qa_exporter11_slingshot_salesforce_invoice,6] from 101,105,104 to 
101 (kafka.clu
[2017-02-28 18:42:45,756] INFO Partition 
[qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Cached zkVersion 
[237] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-02-28 18:42:45,756] INFO Partition [GNRDEV_counters_singleCount,2] on 
broker 101: Shrinking ISR for partition [GNRDEV_counters_singleCount,2] from 
101,105,104 to 101 (kafka.cluster.Partition)
[2017-02-28 18:42:45,761] INFO Partition [GNRDEV_counters_singleCount,2] on 
broker 101: Cached zkVersion [334] not equal to that in zookeeper, skip 
updating ISR (kafka.cluster.Partition)
[2017-02-28 18:42:45,761] INFO Partition [sod-spins-spark-local,1] on broker 
101: Shrinking ISR for partition [sod-spins-spark-local,1] from 101,103,104 to 
101 (kafka.cluster.Partition)
[2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,1] on broker 
101: Cached zkVersion [379] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,11] on broker 
101: Shrinking ISR for partition [sod-spins-spark-local,11] from 102,101,105 to 
101 (kafka.cluster.Partition)
[2017-02-28 18:42:45,767] INFO Partition [sod-spins-spark-local,11] on broker 
101: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
{noformat}

The "current" server is 101.  So it thinks it's the leader for basically every 
partition on that node, but it's refusing to update the ISRs, because the 
cached zkversion doesn't match the one in zookeeper.  This is causing 
permanently under-replicated partitions, because server doesn't ever catch up, 
since it doesn't think there's a problem.  Also, the metadata reported by the 
101 server to consumers indicates it thinks it's part of the ISR, but every 
other broker doesn't think so.

Let me know if more logs/details would be helpful.  I'll try to fix this by 
restarting the node, and hopefully it fixes the issue.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

Reply via email to