Brokers experience continuous ISR shrinks resulting in most partitions having only on broker in ISR

Ravi Kiran Chiruvolu Wed, 13 Jan 2016 21:17:46 -0800

We have a kafka cluster with 22 nodes, which host  ~3700 topics and ~15000 
partitions.
We ran fine for a long time but, one fine day a bunch of brokers (around half 
of the cluster) started getting out of ISRs with the following messages in the 
leaders:A)


[2016-01-12 19:01:19,363] INFO Partition [RADM_3600_7,0] on broker 18: 
Shrinking ISR for partition [RADM_3600_7,0] from 18,25 to 18 
(kafka.cluster.Partition)
[2016-01-12 19:01:19,367] INFO Partition [RADM_3600_7,0] on broker 18: Cached 
zkVersion [5] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)

B) Correspondingly, we had messages in Zookeeper Leader which looked like:
Tue Jan 12 19:01:19 2016: 2016-01-12 19:01:19,364 - INFO  [ProcessThread(sid:2 
cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException when 
processing sessionid:0x251a8968b80f32d type:setData cxid:0x882ade 
zxid:0x501b0a9d1 txntype:-1 reqpath:n/a Error 
Path:/brokers/topics/RADM_3600_7/partitions/0/state Error:KeeperErrorCode = 
BadVersion for /brokers/topics/RADM_3600_7/partitions/0/state


C) In the controller, we were getting messages like: 

[2016-01-12 17:50:15,908] INFO [PreferredReplicaPartitionLeaderSelector]: 
Current leader 25 for partition [RADM_3600_7,0] is not the preferred replica. 
Trigerring preferred replica leader election 
(kafka.controller.PreferredReplicaPartitionLeaderSelector)
[2016-01-12 17:50:15,908] WARN [Controller 17]: Partition [RADM_3600_7,0] 
failed to complete preferred replica leader election. Leader is 25 
(kafka.controller.KafkaController)
before the above shrinking in A) .
Around 70 - 80% of the partitions were operating with only one broker in the 
ISR.We had to clean the state - the data, the topics, everything to finally fix 
this.
Also, we have another deployment which mirrors this one and which has been 
running fine.

G'day,Chiru

Brokers experience continuous ISR shrinks resulting in most partitions having only on broker in ISR

Reply via email to