Hi all, Community members Jason Gustafson, Colin P. McCabe and I have been having some offline conversations.
At a high-level KIP-853 solves the problems: 1) How can KRaft detect and recover from disk failures on the minority of the voters? 2) How can KRaft support a changing set of voter nodes? I think that problem 2) is a superset of problem 1). The mechanism for solving problem 2) can be used to solve problem 1). This is the reason that I decided to design them together and proposed this KIP. Problem 2) adds the additional requirement of how observers (Brokers and new Controllers) discover the leader? KIP-853 solves this problem by returning the endpoint of the leader in all of the KRaft RPCs. There are some concerns with this approach. To solve problem 1) we don't need to return the leader's endpoint since it is expressed in the controller.quorum.voters property. To make faster progress on 1) I have decided to create "KIP-856: KRaft Disk Failure Recovery" that just addresses this problem. I will be starting a discussion thread for KIP-856 soon. We can continue the discussion of KIP-853 here. If KIP-856 gets approved I will either: 3) Modify KIP-853 to just describe the improvement needed on top of KIP-856. 4) Create a new KIP and abandon KIP-853. This new KIP will take into account all of the discussion from this thread. Thanks! -- -José