Hi, we're using Kafka 0.10.1.1 and the streams app is using 0.10.2.1
On Thu, Jun 29, 2017, at 04:21 PM, Bill Bejeck wrote: > Hi Vincent, > > What version of Kafka/Kafka Streams are you running, more specifically > when > this error occurred? > > Thanks, > Bill > > On Wed, Jun 28, 2017 at 12:24 PM, Bill Bejeck <b...@confluent.io> wrote: > > > Thanks for the info Vincent. > > > > -Bill > > > > On Wed, Jun 28, 2017 at 12:19 PM, Vincent Rischmann <m...@vrischmann.me> > > wrote: > > > >> I'm not sure what exactly to search for, but here are a couple of > >> things. > >> > >> First, looking at the broker 1, there's this: > >> https://pastebin.com/raw/XMLpjj7J which I find weird, because I'm sure I > >> didn't create those topics manually...which means to me that the streams > >> application was still up (despite me thinking it was shut down) and > >> therefore it recreated the topics ? Or is there another explanation here > >> ? > >> > >> After that, there is a lot of stuff like this > >> https://pastebin.com/raw/7PfRWuZb until 15:51 when there's this > >> https://pastebin.com/raw/Pnhs4JT1. Still after that at 15:53 the topics > >> are again deleted it seems https://pastebin.com/raw/wTL9PcWJ > >> > >> The broker with the problem was the broker 5. Here's a piece of logs > >> just after I ran the reset script. https://pastebin.com/raw/PgCDZFb8 > >> > >> After that there are a LOT of UnknownTopicOrPartitionException and then > >> this: https://pastebin.com/raw/b7k7yNdQ > >> This repeats for a while until 15:50 when it seems the broker thinks > >> he's the only one in the cluster ? https://pastebin.com/raw/Qy0TGjnx I > >> didn't include everything but it looks like every topic/partition had > >> it's ISR shrink. > >> > >> This goes on for a long time. At 15:57 there's this > >> https://pastebin.com/raw/VrTGdXBZ > >> After that there a lot of NotLeaderForPartitionException > >> https://pastebin.com/raw/PhDQts66 > >> > >> Then at 16:04 according to the log I killed -9 the process. At 16:13 I > >> restarted it. It spent 12 minutes recreating index files and then > >> proceeded to create the topics that I tried to delete: > >> https://pastebin.com/raw/bz9MYRsJ > >> Then just one minute later it deleted the topics > >> https://pastebin.com/raw/EQ2Kuwj6 > >> > >> And after that it operated normally. > >> > >> Sorry if it's a bit random and unorganized, I hope it helps anyway. I > >> can search for some specific things if you'd like. > >> > >> > >> On Wed, Jun 28, 2017, at 12:17 AM, Bill Bejeck wrote: > >> > Thanks Vincent. > >> > > >> > That's a good start for now. > >> > > >> > If you get a chance to forward some logs that would be great. > >> > > >> > -Bill > >> > > >> > On Tue, Jun 27, 2017 at 6:10 PM, Vincent Rischmann <m...@vrischmann.me> > >> > wrote: > >> > > >> > > Sure. > >> > > > >> > > The application reads a topic of keyless events and based on some > >> > > criteria of the event, it creates a new key and uses that for > >> > > `selectKey`. > >> > > Then I groupByKey and count with 3 differents windows. Each count is > >> > > then stored in a database. > >> > > > >> > > The three windows are tumbling windows: > >> > > - 1 minute window, 1 day retention > >> > > - 1 hour window, 7 day retention > >> > > - 1 day window, 15 days retention > >> > > That's basically it for the structure. > >> > > The input topic has 64 partitions. > >> > > > >> > > Tomorrow I can get some logs from Kafka/Zookeeper if that would help. > >> > > > >> > > On Tue, Jun 27, 2017, at 11:41 PM, Bill Bejeck wrote: > >> > > > Hi Vincent, > >> > > > > >> > > > Thanks for reporting this issue. Could you give us some more > >> details > >> > > > (number topics, partitions per topic and the structure of your Kafka > >> > > > Streams application) so we attempt to reproduce and diagnose the > >> issue? > >> > > > > >> > > > Thanks! > >> > > > Bill > >> > > > > >> > > > On 2017-06-27 14:46 (-0400), Vincent Rischmann <m...@vrischmann.me> > >> wrote: > >> > > > > Hello. so I had a weird problem this afternoon. I was deploying a > >> > > > > streams application and wanted to delete already existing internal > >> > > > > states data so I ran kafka-streams-application-reset.sh to do > >> it, as > >> > > > > recommended. it wasn't the first time I ran it and it had always > >> worked > >> > > > > before, in staging or in production. > >> > > > > Anyway, I run the command and around 2/3 minutes later we realize > >> a lot > >> > > > > of stuff using the cluster is basically down, unable to fetch or > >> > > > > produce. After investigating logs from the producers and the > >> brokers I > >> > > > > saw that one broker was not responding, despite the process being > >> up. > >> > > It > >> > > > > kept spewing `UnknownTopicOrPartitionException` in the logs, > >> other > >> > > > > brokers were spewing `NotLeaderForPartitionException` regularly. > >> A > >> > > > > zookeeper node logged a lot of this: > >> > > > > > 2017-06-27 15:51:32,897 [myid:2] - INFO [ProcessThread(sid:2 > >> cport:- > >> > > > > > 1)::PrepRequestProcessor@649] - Got user-level KeeperException > >> when > >> > > > > > processing sessionid:0x159cadf860e0089 type:setData > >> cxid:0x249af08 > >> > > > > > zxid:0xb06b3722e txntype:-1 reqpath:n/a Error > >> > > Path:/brokers/topics/event-counter-per-day-store- > >> > > > > > repartition/partitions/4/state Error:KeeperErrorCode = > >> BadVersion for > >> > > > > > /brokers/topics/event-counter-per-day-store- > >> > > > > > repartition/partitions/4/state > >> > > > > So from my point of view it looked like that one broker was > >> "down", not > >> > > > > responding to user requests but yet it was still seen as up by the > >> > > > > cluster and nobody could produce or fetch for the partitions it > >> was > >> > > > > previously a leader. Running kafka-topics.sh --describe I also > >> saw the > >> > > > > leader being -1 for a bunch of partitions. > >> > > > > As soon as I `kill -9` the process, the cluster stabilized and > >> > > > > everything went back to normal pretty much in seconds, producers > >> were > >> > > > > working again as well as consumers. After I restarted the > >> broker, it > >> > > > > joined the cluster, proceeded to actually do the topic deletions > >> and > >> > > > > rejoined correctly too. > >> > > > > I'm not sure what exactly happened but that was pretty scary. Has > >> it > >> > > > > happened to anyone else ? My completely uneducated guess is that > >> > > > > somehow, using kafka-streams-application-reset.sh on an > >> application > >> > > with > >> > > > > 5 internal topics caused too many deletions at once and thus > >> caused a > >> > > > > broker to end up with a wrong zookeeper state ? I have no idea if > >> > > that's > >> > > > > a plausible explanation. > >> > > > > Anyway, right now I think I'm going to stop using > >> > > kafka-streams-application- > >> > > > > reset.sh and delete the topics one by one > >> > > > > > >> > > > >> > > > >