Todd, the Kafka problems started when one of three ZooKeeper nodes was
restarted.

On Thu, Jul 9, 2015 at 12:10 PM, Todd Palino <tpal...@gmail.com> wrote:

> Did you hit the problems in the Kafka brokers and consumers during the
> Zookeeper problem, or after you had already cleared it?
>
> For us, we decided to skip the leap second problem (even though we're
> supposedly on a version that doesn't have that bug) by shutting down ntpd
> everywhere and then allowing it to slowly adjust the time afterwards
> without sending the leap second.
>
> -Todd
>
>
> On Thu, Jul 9, 2015 at 7:58 AM, Christofer Hedbrandh <
> christo...@knewton.com
> > wrote:
>
> > Hi Kafka users,
> >
> > ZooKeeper in our staging environment was running on a very old ubuntu
> > version, that was exposed to the "leap second causes spuriously high CPU
> > usage" bug.
> >
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285
> >
> > As a result, when the leap second arrived, the ZooKeeper CPU usage went
> up
> > to 100% and stayed there. In response to this, we restarted one ZooKeeper
> > process. The ZooKeeper restart unfortunately made the situation much
> worse
> > as we hit three different (possibly related) Kafka problems. We are using
> > Kafka 0.8.2 brokers, consumers and producers.
> >
> >
> > #1
> > One of our three brokers was kicked out or ISR for some (most but not
> all)
> > partitions, and was continuously logging "Cached zkVersion [XX] not equal
> > to that in zookeeper, skip updating ISR" over and over (until I
> eventually
> > stopped this broker).
> >
> > INFO Partition [topic-x,xx] on broker 1: Shrinking ISR for partition
> > [topic-x,xx] from 1,2,3 to 1 (kafka.cluster.Partition)
> > INFO Partition [topic-x,xx] on broker 1: Cached zkVersion [62] not equal
> to
> > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > INFO Partition [topic-y,yy] on broker 1: Shrinking ISR for partition
> > [topic-y,yy] from 1,2,3 to 1 (kafka.cluster.Partition)
> > INFO Partition [topic-y,yy] on broker 1: Cached zkVersion [39] not equal
> to
> > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > INFO Partition [topic-z,zz] on broker 1: Shrinking ISR for partition
> > [topic-z,zz] from 1,2,3 to 1 (kafka.cluster.Partition)
> > INFO Partition [topic-z,zz] on broker 1: Cached zkVersion [45] not equal
> to
> > that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> > etc.
> >
> > Searching the users@kafka.apache.org archive and Googling for this log
> > output, gives me similar descriptions but nothing that exactly describes
> > this.
> > It is very similar to this, but without the "ERROR Conditional update of
> > path ..." part.
> > https://www.mail-archive.com/users@kafka.apache.org/msg07044.html
> >
> >
> > #2
> > The remaining two brokers were logging this every five seconds or so.
> > INFO conflict in /brokers/ids/xxx data:
> >
> >
> {"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}
> > stored data:
> >
> >
> {"jmx_port":xxx,"timestamp":"1435711782536","host":"xxx","version":1,"port":9092}
> > (kafka.utils.ZkUtils$)
> > INFO I wrote this conflicted ephemeral node
> >
> >
> [{"jmx_port":xxx,"timestamp":"1435712198759","host":"xxx","version":1,"port":9092}]
> > at /brokers/ids/xxx a while back in a different session, hence I will
> > backoff for this node to be deleted by Zookeeper and retry
> > (kafka.utils.ZkUtils$)
> >
> > It sounds very much like we hit this bug
> > https://issues.apache.org/jira/browse/KAFKA-1387
> >
> >
> > #3
> > The most serious issue that resulted was that some consumer groups failed
> > to claim all partitions. When using the ConsumerOffsetChecker, the owner
> of
> > some partitions was listed as "none", the lag was constantly increasing,
> > and it was clear that no consumers were processing these messages.
> >
> > It is exactly what Dave Hamilton is describing here, but from this email
> > chain no one seems to know what caused it.
> > https://www.mail-archive.com/users%40kafka.apache.org/msg13364.html
> >
> > It may be reasonable to assume that the consumer rebalance failures we
> also
> > saw has something to do with this. But why the rebalance failed is still
> > unclear.
> >
> > ERROR k.c.ZookeeperConsumerConnector: error during syncedRebalance
> > kafka.common.ConsumerRebalanceFailedException: xxx can't rebalance after
> 4
> > retries
> > at
> >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:633)
> > at
> >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:551)
> >
> >
> > I am curious to hear if anyone else had similar problems to this?
> >
> > And also if anyone can say if these are all known bugs that are being
> > tracked with some ticket number?
> >
> >
> > Thanks,
> > Christofer
> >
> > P.S. Eventually after ZooKeeper and Kafka broker and consumer restarts
> > everything returned to normal.
> >
>

Reply via email to