Re: Weird broker lock-up causing an almost global downtime

Bill Bejeck Tue, 27 Jun 2017 15:08:55 -0700

Hi Vincent,

Thanks for reporting this.


Could you give some details on your setup (topics, partitions and the structure 
of your streams application) so I can attempt to reproduce the issue?

Thanks!

On 2017-06-27 14:46 (-0400), Vincent Rischmann <m...@vrischmann.me> wrote: 
> Hello. so I had a weird problem this afternoon. I was deploying a
> streams application and wanted to delete already existing internal
> states data so I ran kafka-streams-application-reset.sh to do it, as
> recommended. it wasn't the first time I ran it and it had always worked
> before, in staging or in production.
> Anyway, I run the command and around 2/3 minutes later we realize a lot
> of stuff using the cluster is basically down, unable to fetch or
> produce. After investigating logs from the producers and the brokers I
> saw that one broker was not responding, despite the process being up. It
> kept spewing `UnknownTopicOrPartitionException` in the logs, other
> brokers were spewing `NotLeaderForPartitionException` regularly. A
> zookeeper node logged a lot of this:
> > 2017-06-27 15:51:32,897 [myid:2] - INFO  [ProcessThread(sid:2 cport:-
> > 1)::PrepRequestProcessor@649] - Got user-level KeeperException when
> > processing sessionid:0x159cadf860e0089 type:setData cxid:0x249af08
> > zxid:0xb06b3722e txntype:-1 reqpath:n/a Error 
> > Path:/brokers/topics/event-counter-per-day-store-
> > repartition/partitions/4/state Error:KeeperErrorCode = BadVersion for
> > /brokers/topics/event-counter-per-day-store-
> > repartition/partitions/4/state
> So from my point of view it looked like that one broker was "down", not
> responding to user requests but yet it was still seen as up by the
> cluster and nobody could produce or fetch for the partitions it was
> previously a leader. Running kafka-topics.sh --describe I also saw the
> leader being -1 for a bunch of partitions.
>  As soon as I `kill -9` the process, the cluster stabilized and
>  everything went back to normal pretty much in seconds, producers were
>  working again as well as consumers. After I restarted the broker, it
>  joined the cluster, proceeded to actually do the topic deletions and
>  rejoined correctly too.
> I'm not sure what exactly happened but that was pretty scary. Has it
> happened to anyone else ? My completely uneducated guess is that
> somehow, using kafka-streams-application-reset.sh on an application with
> 5 internal topics caused too many deletions at once and thus caused a
> broker to end up with a wrong zookeeper state ? I have no idea if that's
> a plausible explanation.
> Anyway, right now I think I'm going to stop using kafka-streams-application-
> reset.sh and delete the topics one by one
>

Re: Weird broker lock-up causing an almost global downtime

Reply via email to