We have a kafka cluster with 22 nodes, which host ~3700 topics and ~15000 partitions. We ran fine for a long time but, one fine day a bunch of brokers (around half of the cluster) started getting out of ISRs with the following messages in the leaders:A)
[2016-01-12 19:01:19,363] INFO Partition [RADM_3600_7,0] on broker 18: Shrinking ISR for partition [RADM_3600_7,0] from 18,25 to 18 (kafka.cluster.Partition) [2016-01-12 19:01:19,367] INFO Partition [RADM_3600_7,0] on broker 18: Cached zkVersion [5] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) B) Correspondingly, we had messages in Zookeeper Leader which looked like: Tue Jan 12 19:01:19 2016: 2016-01-12 19:01:19,364 - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException when processing sessionid:0x251a8968b80f32d type:setData cxid:0x882ade zxid:0x501b0a9d1 txntype:-1 reqpath:n/a Error Path:/brokers/topics/RADM_3600_7/partitions/0/state Error:KeeperErrorCode = BadVersion for /brokers/topics/RADM_3600_7/partitions/0/state C) In the controller, we were getting messages like: [2016-01-12 17:50:15,908] INFO [PreferredReplicaPartitionLeaderSelector]: Current leader 25 for partition [RADM_3600_7,0] is not the preferred replica. Trigerring preferred replica leader election (kafka.controller.PreferredReplicaPartitionLeaderSelector) [2016-01-12 17:50:15,908] WARN [Controller 17]: Partition [RADM_3600_7,0] failed to complete preferred replica leader election. Leader is 25 (kafka.controller.KafkaController) before the above shrinking in A) . Around 70 - 80% of the partitions were operating with only one broker in the ISR.We had to clean the state - the data, the topics, everything to finally fix this. Also, we have another deployment which mirrors this one and which has been running fine. G'day,Chiru