Re: Weird broker lock-up causing an almost global downtime

Bill Bejeck Thu, 29 Jun 2017 07:21:27 -0700

Hi Vincent,

What version of Kafka/Kafka Streams are you running, more specifically when
this error occurred?


Thanks,
Bill

On Wed, Jun 28, 2017 at 12:24 PM, Bill Bejeck <b...@confluent.io> wrote:

> Thanks for the info Vincent.
>
> -Bill
>
> On Wed, Jun 28, 2017 at 12:19 PM, Vincent Rischmann <m...@vrischmann.me>
> wrote:
>
>> I'm not sure what exactly to search for, but here are a couple of
>> things.
>>
>> First, looking at the broker 1, there's this:
>> https://pastebin.com/raw/XMLpjj7J which I find weird, because I'm sure I
>> didn't create those topics manually...which means to me that the streams
>> application was still up (despite me thinking it was shut down) and
>> therefore it recreated the topics ? Or is there another explanation here
>> ?
>>
>> After that, there is a lot of stuff like this
>> https://pastebin.com/raw/7PfRWuZb until 15:51 when there's this
>> https://pastebin.com/raw/Pnhs4JT1. Still after that at 15:53 the topics
>> are again deleted it seems https://pastebin.com/raw/wTL9PcWJ
>>
>> The broker with the problem was the broker 5. Here's a piece of logs
>> just after I ran the reset script. https://pastebin.com/raw/PgCDZFb8
>>
>> After that there are a LOT of UnknownTopicOrPartitionException and then
>> this: https://pastebin.com/raw/b7k7yNdQ
>> This repeats for a while until 15:50 when it seems the broker thinks
>> he's the only one in the cluster ? https://pastebin.com/raw/Qy0TGjnx I
>> didn't include everything but it looks like every topic/partition had
>> it's ISR shrink.
>>
>> This goes on for a long time. At  15:57 there's this
>> https://pastebin.com/raw/VrTGdXBZ
>> After that there a lot of NotLeaderForPartitionException
>> https://pastebin.com/raw/PhDQts66
>>
>> Then at 16:04 according to the log I killed -9 the process. At 16:13 I
>> restarted it. It spent 12 minutes recreating index files and then
>> proceeded to create the topics that I tried to delete:
>> https://pastebin.com/raw/bz9MYRsJ
>> Then just one minute later it deleted the topics
>> https://pastebin.com/raw/EQ2Kuwj6
>>
>> And after that it operated normally.
>>
>> Sorry if it's a bit random and unorganized, I hope it helps anyway. I
>> can search for some specific things if you'd like.
>>
>>
>> On Wed, Jun 28, 2017, at 12:17 AM, Bill Bejeck wrote:
>> > Thanks Vincent.
>> >
>> > That's a good start for now.
>> >
>> > If you get a chance to forward some logs that would be great.
>> >
>> > -Bill
>> >
>> > On Tue, Jun 27, 2017 at 6:10 PM, Vincent Rischmann <m...@vrischmann.me>
>> > wrote:
>> >
>> > > Sure.
>> > >
>> > > The application reads a topic of keyless events and based on some
>> > > criteria of the event, it creates a new key and uses that for
>> > > `selectKey`.
>> > > Then I groupByKey and count with 3 differents windows. Each count is
>> > > then stored in a database.
>> > >
>> > > The three windows are tumbling windows:
>> > >   - 1 minute window, 1 day retention
>> > >   - 1 hour window, 7 day retention
>> > >   - 1 day window, 15 days retention
>> > > That's basically it for the structure.
>> > > The input topic has 64 partitions.
>> > >
>> > > Tomorrow I can get some logs from Kafka/Zookeeper if that would help.
>> > >
>> > > On Tue, Jun 27, 2017, at 11:41 PM, Bill Bejeck wrote:
>> > > > Hi Vincent,
>> > > >
>> > > > Thanks for reporting this issue.  Could you give us some more
>> details
>> > > > (number topics, partitions per topic and the structure of your Kafka
>> > > > Streams application) so we attempt to reproduce and diagnose the
>> issue?
>> > > >
>> > > > Thanks!
>> > > > Bill
>> > > >
>> > > > On 2017-06-27 14:46 (-0400), Vincent Rischmann <m...@vrischmann.me>
>> wrote:
>> > > > > Hello. so I had a weird problem this afternoon. I was deploying a
>> > > > > streams application and wanted to delete already existing internal
>> > > > > states data so I ran kafka-streams-application-reset.sh to do
>> it, as
>> > > > > recommended. it wasn't the first time I ran it and it had always
>> worked
>> > > > > before, in staging or in production.
>> > > > > Anyway, I run the command and around 2/3 minutes later we realize
>> a lot
>> > > > > of stuff using the cluster is basically down, unable to fetch or
>> > > > > produce. After investigating logs from the producers and the
>> brokers I
>> > > > > saw that one broker was not responding, despite the process being
>> up.
>> > > It
>> > > > > kept spewing `UnknownTopicOrPartitionException` in the logs,
>> other
>> > > > > brokers were spewing `NotLeaderForPartitionException` regularly.
>> A
>> > > > > zookeeper node logged a lot of this:
>> > > > > > 2017-06-27 15:51:32,897 [myid:2] - INFO  [ProcessThread(sid:2
>> cport:-
>> > > > > > 1)::PrepRequestProcessor@649] - Got user-level KeeperException
>> when
>> > > > > > processing sessionid:0x159cadf860e0089 type:setData
>> cxid:0x249af08
>> > > > > > zxid:0xb06b3722e txntype:-1 reqpath:n/a Error
>> > > Path:/brokers/topics/event-counter-per-day-store-
>> > > > > > repartition/partitions/4/state Error:KeeperErrorCode =
>> BadVersion for
>> > > > > > /brokers/topics/event-counter-per-day-store-
>> > > > > > repartition/partitions/4/state
>> > > > > So from my point of view it looked like that one broker was
>> "down", not
>> > > > > responding to user requests but yet it was still seen as up by the
>> > > > > cluster and nobody could produce or fetch for the partitions it
>> was
>> > > > > previously a leader. Running kafka-topics.sh --describe I also
>> saw the
>> > > > > leader being -1 for a bunch of partitions.
>> > > > >  As soon as I `kill -9` the process, the cluster stabilized and
>> > > > >  everything went back to normal pretty much in seconds, producers
>> were
>> > > > >  working again as well as consumers. After I restarted the
>> broker, it
>> > > > >  joined the cluster, proceeded to actually do the topic deletions
>> and
>> > > > >  rejoined correctly too.
>> > > > > I'm not sure what exactly happened but that was pretty scary. Has
>> it
>> > > > > happened to anyone else ? My completely uneducated guess is that
>> > > > > somehow, using kafka-streams-application-reset.sh on an
>> application
>> > > with
>> > > > > 5 internal topics caused too many deletions at once and thus
>> caused a
>> > > > > broker to end up with a wrong zookeeper state ? I have no idea if
>> > > that's
>> > > > > a plausible explanation.
>> > > > > Anyway, right now I think I'm going to stop using
>> > > kafka-streams-application-
>> > > > > reset.sh and delete the topics one by one
>> > > > >
>> > >
>>
>
>

Re: Weird broker lock-up causing an almost global downtime

Reply via email to