Hello Kafka users, We have tested failure scenarios and found the following situation in which the kafka cluster will not automatically recover.
Cluster setup: 3 VMs (n1,n2,n3) running Centos 7, each VM runs a zookeper v3.4.13 and kafka v2.1.0 instance, configured as systemd services, OpenJDK 1.8.0_191. Current situation: n1 Kafka controller, n2 and n3 leaders for some partitions: ~/kafka/bin/kafka-topics.sh --zookeeper n1:2181 --describe --topic __consumer_offsets Topic:__consumer_offsets PartitionCount:1 ReplicationFactor:3 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer Topic: __consumer_offsets Partition: 0 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2 If I reboot simultaneously both n2 and n3, causing zookeeper to lose quorum and topics to lose their leaders, the kafka cluster will never recover. Zookeper regains quorum as soon as n2 and n3 are back up, n1 remains kafka controller, but I get the following error in all 3 kafka logs, repeated forever: [2019-03-20 09:43:02,524] ERROR [KafkaApi-2] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis) [2019-03-20 09:43:02,830] ERROR [KafkaApi-2] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis) [2019-03-20 09:43:03,486] ERROR [KafkaApi-2] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis) If I restart the kafka process on n1 (old controller) - the cluster fully recovers. However the old controller does not shut down gracefully (I see "Retrying controlled shutdown after the previous attempt failed..." in the logs and is eventually killed by systemd). I could not reproduce the problem if one of the rebooted nodes is the controller. It looks to me like a race condition, as I can only reproduce the problem about half the time. Best regards, Radu