Re: Leap Second Troubles

Todd Palino Thu, 09 Jul 2015 09:11:53 -0700

Did you hit the problems in the Kafka brokers and consumers during the
Zookeeper problem, or after you had already cleared it?


For us, we decided to skip the leap second problem (even though we're
supposedly on a version that doesn't have that bug) by shutting down ntpd
everywhere and then allowing it to slowly adjust the time afterwards
without sending the leap second.

-Todd


On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh <christo...@knewton.com
> wrote:

> Hi Kafka users,
>
> ZooKeeper in our staging environment was running on a very old ubuntu
> version, that was exposed to the "leap second causes spuriously high CPU
> usage" bug.
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285
>
> As a result, when the leap second arrived, the ZooKeeper CPU usage went up
> to 100% and stayed there. In response to this, we restarted one ZooKeeper
> process. The ZooKeeper restart unfortunately made the situation much worse
> as we hit three different (possibly related) Kafka problems. We are using
> Kafka 0.8.2 brokers, consumers and producers.
>
>
> #1
> One of our three brokers was kicked out or ISR for some (most but not all)
> partitions, and was continuously logging "Cached zkVersion [XX] not equal
> to that in zookeeper, skip updating ISR" over and over (until I eventually
> stopped this broker).
>
> INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
> [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
> INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal to
> that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
> [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
> INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal to
> that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
> [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
> INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal to
> that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> etc.
>
> Searching the users@kafka.apache.org archive and Googling for this log
> output, gives me similar descriptions but nothing that exactly describes
> this.
> It is very similar to this, but without the "ERROR Conditional update of
> path ..." part.
> https://www.mail-archive.com/users@kafka.apache.org/msg07044.html
>
>
> #2
> The remaining two brokers were logging this every five seconds or so.
> INFO conflict in /brokers/ids/xxx data:
>
> {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
> stored data:
>
> {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
> (kafka.utils.ZkUtils$)
> INFO I wrote this conflicted ephemeral node
>
> [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
> at /brokers/ids/xxx a while back in a different session, hence I will
> backoff for this node to be deleted by Zookeeper and retry
> (kafka.utils.ZkUtils$)
>
> It sounds very much like we hit this bug
> https://issues.apache.org/jira/browse/KAFKA-1387
>
>
> #3
> The most serious issue that resulted was that some consumer groups failed
> to claim all partitions. When using the ConsumerOffsetChecker, the owner of
> some partitions was listed as "none", the lag was constantly increasing,
> and it was clear that no consumers were processing these messages.
>
> It is exactly what Dave Hamilton is describing here, but from this email
> chain no one seems to know what caused it.
> https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html
>
> It may be reasonable to assume that the consumer rebalance failures we also
> saw has something to do with this. But why the rebalance failed is still
> unclear.
>
> ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
> kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after 4
> retries
> at
>
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
> at
>
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)
>
>
> I am curious to hear if anyone else had similar problems to this?
>
> And also if anyone can say if these are all known bugs that are being
> tracked with some ticket number?
>
>
> Thanks,
> Christofer
>
> P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
> everything returned to normal.
>

Re: Leap Second Troubles

Reply via email to