Martin Dickson created KAFKA-17562: -------------------------------------- Summary: Failure detection for degraded brokers Key: KAFKA-17562 URL: https://issues.apache.org/jira/browse/KAFKA-17562 Project: Kafka Issue Type: Improvement Components: core, replication Reporter: Martin Dickson
Follow on from [this mailing list discussion|[https://lists.apache.org/thread/z8xn2dm1zm3clymhh60hf7rzgw286k8q].] When a leader for a partition becomes degraded but does not fully fail it can remove all follower replicas from ISR. This can happen solely due to a problem with the leader (slow disk, degraded network, ...), and hence a single failure can make the partition unavailable for writes (assuming min.insync.replicas=2). If the leader then fully fails the partition goes offline, which introduces data loss risks during recovery. The recovery options will improve substantially with KIP-966 (again assuming min.insync.replicas=2), but we there is still a gap around failure detection. In particular, KIP-966 alone doesn't help with the case when the broker is degraded but does not fully fail for a long period of time. Currently Kafka failure detection is based on whether the broker can maintain its connection with the metadata quorum. The suggestion here is to consider more comprehensive failure detection, which could be handled by demoting leadership rather than fully fencing. -- This message was sent by Atlassian Jira (v8.20.10#820010)