Hi Kafka users,

ZooKeeper in our staging environment was running on a very old ubuntu
version, that was exposed to the "leap second causes spuriously high CPU
usage" bug.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285

As a result, when the leap second arrived, the ZooKeeper CPU usage went up
to 100% and stayed there. In response to this, we restarted one ZooKeeper
process. The ZooKeeper restart unfortunately made the situation much worse
as we hit three different (possibly related) Kafka problems. We are using
Kafka 0.8.2 brokers, consumers and producers.


#1
One of our three brokers was kicked out or ISR for some (most but not all)
partitions, and was continuously logging "Cached zkVersion [XX] not equal
to that in zookeeper, skip updating ISR" over and over (until I eventually
stopped this broker).

INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
[topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal to
that in zookeeper, skip updating ISR (kafka.cluster.Partition)
INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
[topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal to
that in zookeeper, skip updating ISR (kafka.cluster.Partition)
INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
[topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal to
that in zookeeper, skip updating ISR (kafka.cluster.Partition)
etc.

Searching the users@kafka.apache.org archive and Googling for this log
output, gives me similar descriptions but nothing that exactly describes
this.
It is very similar to this, but without the "ERROR Conditional update of
path ..." part.
https://www.mail-archive.com/users@kafka.apache.org/msg07044.html


#2
The remaining two brokers were logging this every five seconds or so.
INFO conflict in /brokers/ids/xxx data:
{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
stored data:
{"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
(kafka.utils.ZkUtils$)
INFO I wrote this conflicted ephemeral node
[{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
at /brokers/ids/xxx a while back in a different session, hence I will
backoff for this node to be deleted by Zookeeper and retry
(kafka.utils.ZkUtils$)

It sounds very much like we hit this bug
https://issues.apache.org/jira/browse/KAFKA-1387


#3
The most serious issue that resulted was that some consumer groups failed
to claim all partitions. When using the ConsumerOffsetChecker, the owner of
some partitions was listed as "none", the lag was constantly increasing,
and it was clear that no consumers were processing these messages.

It is exactly what Dave Hamilton is describing here, but from this email
chain no one seems to know what caused it.
https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html

It may be reasonable to assume that the consumer rebalance failures we also
saw has something to do with this. But why the rebalance failed is still
unclear.

ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after 4
retries
at
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
at
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)


I am curious to hear if anyone else had similar problems to this?

And also if anyone can say if these are all known bugs that are being
tracked with some ticket number?


Thanks,
Christofer

P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
everything returned to normal.

Reply via email to