Hi - We have a small Kafka 2.0.0 (Zookeeper 3.4.13) cluster with 3 brokers: 0, 1, and 2. Each broker is in a separate rack (Azure zone).
Recently there was an incident, where Kafka brokers and Zookeeper nodes restarted, etc. After that occurred, we've had problems where broker 2 is consistently out of many ISRs. A pattern we've observed is that broker 2 will not be in any ISRs of partitions where broker 0 is leader, but will be in ISRs of partitions where broker 1 is leader. Then at some point the controller will change to a different broker, then 2 will not be in any ISRs where 1 is leader, but will be in ISRs where 0 is leader. Each time controller changes, this "flip flopping" of 2 in/out of ISRs changes. No matter what, 2 never seems to get into all ISRs. For topics with replicas=3, min.insync.replicas=2, and producers with acks=all, we only ever have ISR=(0,1), and occasionally 0 or 1 also briefly falls out of ISR, leading to producer retries and sometimes send failures for producers that use retries=3. Any ideas what might be happening here, and how we could fix it? Or additional data we could collect to try to diagnose the problem? We are planning to upgrade this cluster as soon as we get it working correctly. Thanks, Zach