Martin Dickson created KAFKA-17562:
--------------------------------------

             Summary: Failure detection for degraded brokers
                 Key: KAFKA-17562
                 URL: https://issues.apache.org/jira/browse/KAFKA-17562
             Project: Kafka
          Issue Type: Improvement
          Components: core, replication
            Reporter: Martin Dickson


Follow on from [this mailing list 
discussion|[https://lists.apache.org/thread/z8xn2dm1zm3clymhh60hf7rzgw286k8q].]

When a leader for a partition becomes degraded but does not fully fail it can 
remove all follower replicas from ISR. This can happen solely due to a problem 
with the leader (slow disk, degraded network, ...), and hence a single failure 
can make the partition unavailable for writes (assuming min.insync.replicas=2). 
If the leader then fully fails the partition goes offline, which introduces 
data loss risks during recovery.

The recovery options will improve substantially with KIP-966 (again assuming 
min.insync.replicas=2), but we there is still a gap around failure detection. 
In particular, KIP-966 alone doesn't help with the case when the broker is 
degraded but does not fully fail for a long period of time.

Currently Kafka failure detection is based on whether the broker can maintain 
its connection with the metadata quorum. The suggestion here is to consider 
more comprehensive failure detection, which could be handled by demoting 
leadership rather than fully fencing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to