Hi Kafka community, I like to propose a small change related to OfflinePartitionLeaderElectionStrategy. In our system, we usually has RF = 3, Min_ISR = 2, unclean.leader.election = false and client usually set the ACK.all when publishing. We have observed that occasionally, when disk become bad, we have partition offline and stayed on the offline state, which of cause, causing the availability issue and we have to manually set unclean.leader.election = true to bring the partition online. This partition offlie due to disk failure become a huge operational pain for us.
Looking into, the sequence of events are: 1. First, ISR for that partition drops to 1 (maybe bad disk causing the broker to respond to fetch more slowly. Note dead disk doesn't cause this to happen every time, but occasionally) 2. Then disk completely give up and the failure causing leader replica offline 3. Because the ISR is 1, OfflinePartitionLeaderElectionStrategy won't choose the leader if unclean.leader.election = false. The observation here is, in this case, even the last failed replica is not in ISR, it still should have the HW same as the failed leader replica. So the OfflinePartitionLeaderElectionStrategy should select the last failed replica as the leader, espcially if it has the same HW. So the proposal is: 1. Choose replica as the leader if it has the same HW (and even it is not in ISR) 2. Further, when unclean.leader.election = true, choose the replica with highest HW as the leader. Let me know if this makes sense or any suggestions. If yes, I will create a JIRA and work on it. Thanks! Ming