To manually failover the controller, just delete the /controller znode in zookeeper
On Wed, Mar 22, 2017 at 11:46 AM, Marcos Juarez <mjua...@gmail.com> wrote: > We're seeing the same exact pattern of ISR shrinking/resizing, mostly on > partitions with the largest volume, with thousands of messages per second. > It happens at least a hundred times a day in our production cluster. We do > have hundreds of topics in our cluster, most of them with 20 or more > partitions, but most of them see only a few hundred messages per minute. > > We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1 > version would fix the issue, but we've already deployed that version to our > staging cluster, and we're seeing the same problem. We still haven't tried > out the latest 0.10.2.0 version, but I don't see any evidence pointing to a > fix for that. > > This ticket seems to have some similar details, but it doesn't seem like > there has been follow-up, and there's no target release for fixing: > > https://issues.apache.org/jira/browse/KAFKA-4674 > > Jun Ma, what exactly did you do to failover the controller to a new > broker? If that works for you, I'd like to try it in our staging clusters. > > Thanks, > > Marcos Juarez > > > > > > On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1...@gmail.com> wrote: > >> I have similar issue with our cluster. We don’t know the root cause but >> we have some interesting observation. >> >> 1. We do see correlation between ISR churn and fetcher connection >> close/create. >> >> 2. We’ve tried to add a broker which doesn’t have any partitions on it >> dedicate to the controller (rolling restart existing brokers and failover >> the controller to the newly added broker), and that indeed eliminate the >> random ISR churn. We have a cluster of 6 brokers (besides the dedicated >> controller) and each one has about 300 partitions on it. I suspect that >> kafka broker cannot handle running controller + 300 partitions. >> >> Anyway that’s so far what I got, I’d also like to know how to debug this. >> We’re running kafka 0.9.0.1 with heap size 8G. >> >> Thanks, >> Jun >> >> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.re...@gmail.com> wrote: >> >> Any erros related to zookeeper seesion timeout? We can also check for >> excesssive GC. >> Some times this may due to forming multiple controllers due to soft >> failures. >> You can check ActiveControllerCount on brokers, only one broker in the >> cluster should have 1. >> Also check for network issues/partitions >> >> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradu...@gmail.com> wrote: >> >> Hello, >> >> Does anyone know how I can debug high ISR churn on the kafka leader on a >> cluster without traffic? I have 2 topics on a 4 node cluster (replica 4 >> and replica 3) and both show constant changes of the number of insync >> replicas: >> >> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker >> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5 >> (kafka.cluster.Partition) >> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker >> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5 >> (kafka.cluster.Partition) >> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker >> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4 >> (kafka.cluster.Partition) >> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker >> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5 >> (kafka.cluster.Partition) >> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker >> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4 >> (kafka.cluster.Partition) >> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker >> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4 >> (kafka.cluster.Partition) >> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker >> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5 >> (kafka.cluster.Partition) >> >> and >> >> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking >> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition) >> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding >> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition) >> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking >> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition) >> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding >> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition) >> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding >> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition) >> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking >> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1 >> (kafka.cluster.Partition) >> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding >> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition) >> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking >> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2 >> (kafka.cluster.Partition) >> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding >> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition) >> (END) >> >> I have tried increasing the num.network.threads (now 8) and >> num.replica.fetchers (now 2) but nothing has changed. >> >> The kafka server config is: >> >> default.replication.factor=4 >> log.retention.check.interval.ms=300000 >> log.retention.hours=168 >> log.roll.hours=24 >> log.segment.bytes=104857600 >> min.insync.replicas=2 >> num.io.threads=8 >> num.network.threads=15 >> num.partitions=1 >> num.recovery.threads.per.data.dir=1 >> num.replica.fetchers=2 >> offsets.topic.num.partitions=1 >> offsets.topic.replication.factor=3 >> replica.lag.time.max.ms=500 >> socket.receive.buffer.bytes=102400 >> socket.request.max.bytes=104857600 >> socket.send.buffer.bytes=102400 >> unclean.leader.election.enable=false >> zookeeper.connection.timeout.ms=3000 >> >> Best regards, >> Radu >> >> >> >