Todd, the Kafka problems started when one of three ZooKeeper nodes was restarted.
On Thu, Jul 9, 2015 at 12:10 PM, Todd Palino <tpal...@gmail.com> wrote: > Did you hit the problems in the Kafka brokers and consumers during the > Zookeeper problem, or after you had already cleared it? > > For us, we decided to skip the leap second problem (even though we're > supposedly on a version that doesn't have that bug) by shutting down ntpd > everywhere and then allowing it to slowly adjust the time afterwards > without sending the leap second. > > -Todd > > > On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh < > christo...@knewton.com > > wrote: > > > Hi Kafka users, > > > > ZooKeeper in our staging environment was running on a very old ubuntu > > version, that was exposed to the "leap second causes spuriously high CPU > > usage" bug. > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285 > > > > As a result, when the leap second arrived, the ZooKeeper CPU usage went > up > > to 100% and stayed there. In response to this, we restarted one ZooKeeper > > process. The ZooKeeper restart unfortunately made the situation much > worse > > as we hit three different (possibly related) Kafka problems. We are using > > Kafka 0.8.2 brokers, consumers and producers. > > > > > > #1 > > One of our three brokers was kicked out or ISR for some (most but not > all) > > partitions, and was continuously logging "Cached zkVersion [XX] not equal > > to that in zookeeper, skip updating ISR" over and over (until I > eventually > > stopped this broker). > > > > INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition > > [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition) > > INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal > to > > that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition > > [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition) > > INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal > to > > that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition > > [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition) > > INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal > to > > that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > etc. > > > > Searching the users@kafka.apache.org archive and Googling for this log > > output, gives me similar descriptions but nothing that exactly describes > > this. > > It is very similar to this, but without the "ERROR Conditional update of > > path ..." part. > > https://www.mail-archive.com/users@kafka.apache.org/msg07044.html > > > > > > #2 > > The remaining two brokers were logging this every five seconds or so. > > INFO conflict in /brokers/ids/xxx data: > > > > > {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092} > > stored data: > > > > > {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092} > > (kafka.utils.ZkUtils$) > > INFO I wrote this conflicted ephemeral node > > > > > [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}] > > at /brokers/ids/xxx a while back in a different session, hence I will > > backoff for this node to be deleted by Zookeeper and retry > > (kafka.utils.ZkUtils$) > > > > It sounds very much like we hit this bug > > https://issues.apache.org/jira/browse/KAFKA-1387 > > > > > > #3 > > The most serious issue that resulted was that some consumer groups failed > > to claim all partitions. When using the ConsumerOffsetChecker, the owner > of > > some partitions was listed as "none", the lag was constantly increasing, > > and it was clear that no consumers were processing these messages. > > > > It is exactly what Dave Hamilton is describing here, but from this email > > chain no one seems to know what caused it. > > https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html > > > > It may be reasonable to assume that the consumer rebalance failures we > also > > saw has something to do with this. But why the rebalance failed is still > > unclear. > > > > ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance > > kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after > 4 > > retries > > at > > > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633) > > at > > > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551) > > > > > > I am curious to hear if anyone else had similar problems to this? > > > > And also if anyone can say if these are all known bugs that are being > > tracked with some ticket number? > > > > > > Thanks, > > Christofer > > > > P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts > > everything returned to normal. > > >