Hi - We have a small Kafka 2.0.0 (Zookeeper 3.4.13) cluster with 3 brokers:
0, 1, and 2. Each broker is in a separate rack (Azure zone).

Recently there was an incident, where Kafka brokers and Zookeeper nodes
restarted, etc. After that occurred, we've had problems where broker 2 is
consistently out of many ISRs. A pattern we've observed is that broker 2
will not be in any ISRs of partitions where broker 0 is leader, but will be
in ISRs of partitions where broker 1 is leader. Then at some point the
controller will change to a different broker, then 2 will not be in any
ISRs where 1 is leader, but will be in ISRs where 0 is leader. Each time
controller changes, this "flip flopping" of 2 in/out of ISRs changes. No
matter what, 2 never seems to get into all ISRs.

For topics with replicas=3, min.insync.replicas=2, and producers with
acks=all, we only ever have ISR=(0,1), and occasionally 0 or 1 also briefly
falls out of ISR, leading to producer retries and sometimes send failures
for producers that use retries=3.

Any ideas what might be happening here, and how we could fix it? Or
additional data we could collect to try to diagnose the problem? We are
planning to upgrade this cluster as soon as we get it working correctly.

Thanks,
Zach

Reply via email to