[ https://issues.apache.org/jira/browse/KAFKA-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jose Armando Garcia Sancio reassigned KAFKA-9672: ------------------------------------------------- Assignee: Jose Armando Garcia Sancio > Dead brokers in ISR cause isr-expiration to fail with exception > --------------------------------------------------------------- > > Key: KAFKA-9672 > URL: https://issues.apache.org/jira/browse/KAFKA-9672 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.4.0, 2.4.1 > Reporter: Ivan Yurchenko > Assignee: Jose Armando Garcia Sancio > Priority: Major > > We're running Kafka 2.4 and facing a pretty strange situation. > Let's say there were three brokers in the cluster 0, 1, and 2. Then: > 1. Broker 3 was added. > 2. Partitions were reassigned from broker 0 to broker 3. > 3. Broker 0 was shut down (not gracefully) and removed from the cluster. > 4. We see the following state in ZooKeeper: > {code:java} > ls /brokers/ids > [1, 2, 3] > get /brokers/topics/foo > {"version":2,"partitions":{"0":[2,1,3]},"adding_replicas":{},"removing_replicas":{}} > get /brokers/topics/foo/partitions/0/state > {"controller_epoch":123,"leader":1,"version":1,"leader_epoch":42,"isr":[0,2,3,1]} > {code} > It means, the dead broker 0 remains in the partitions's ISR. A big share of > the partitions in the cluster have this issue. > This is actually causing an errors: > {code:java} > Uncaught exception in scheduled task 'isr-expiration' > (kafka.utils.KafkaScheduler) > org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id > 12 is not available on broker 17 > {code} > It means that effectively {{isr-expiration}} task is not working any more. > I have a suspicion that this was introduced by [this commit (line > selected)|https://github.com/apache/kafka/commit/57baa4079d9fc14103411f790b9a025c9f2146a4#diff-5450baca03f57b9f2030f93a480e6969R856] > Unfortunately, I haven't been able to reproduce this in isolation. > Any hints about how to reproduce (so I can write a patch) or mitigate the > issue on a running cluster are welcome. > Generally, I assume that not throwing {{ReplicaNotAvailableException}} on a > dead (i.e. non-existent) broker, considering them out-of-sync and removing > from the ISR should fix the problem. > -- This message was sent by Atlassian Jira (v8.3.4#803005)