Thanks for the feedback Colin. Comments below.

On Tue, Jan 9, 2024 at 4:58 PM Colin McCabe <cmcc...@apache.org> wrote:
> 1. restarting a controller with an empty storage directory
>
> The controller can contact the quorum to get the cluster ID and current MV. 
> If the MV doesn't support quorum reconfiguration, it can bail out. Otherwise, 
> it can re-format itself with a random directory ID. It can then remove (ID, 
> OLD_DIR_ID) from the quorum, and add (ID, NEW_DIR_ID) to the quorum.
>
> I think this can all be done automatically without user intervention. If the 
> remove / add steps fail (because the quorum is down, for example), then of 
> course we can just log an exception and bail out.
>

Yes. We should be able to implement this. I'll update the KIP and add
another configuration: controller.quorum.auto.join.enabled=true

The high-level algorithm is something like this:
1. The controllers will fetch the latest quorum state from the Leader
2. The controller will remove any voter that matches its replica id
but doesn't match its directory id (replica uuid).
3. If the controller (replica id and replica uuid) is not in the voter
set it sends a AddVoter RPC to the controller until it sees itself in
the voter set.

> 2. restarting a broker with an empty storage directory
>
> The broker can contact the quorum to get the cluster ID and current MV. If 
> the MV doesn't support directory IDs, we can bail out. Otherwise, it can 
> reformat itself with a random directory ID and start up. Its old replicas 
> will be correctly treated as gone due to the JBOD logic.

This feature seems reasonable to me. I don't think we should make this
part of this KIP. It should be a seperate KIP as it is not related to
controller dynamic membership changes.

> 4. Bringing up a totally new cluster
>
> I think we need at least one controller node to be formatted, so that we can 
> decide what metadata version to use. Perhaps we should even require a quorum 
> of controller nodes to be explicitly formatted (aka, in practice, people just 
> format them all).

Yes. When I document this feature my recommended process would be:
1. One of the controllers needs to be formatted in --standalone
(kafka-storage format --cluster-id <cluster-id> --release-version 3.8
--standalone --config controller.properties). This needs to be an
explicit operation as it violates one of the invariants enumerated in
the KIP.
"To make changes to the voter set safe it is required that the
majority of the competing voter sets commit the voter changes. In this
design the competing voter sets are the current voter set and new
voter set. Since this design only allows one voter change at a time
the majority of the new configuration always overlaps (intercepts) the
majority of the old configuration. This is done by the leader
committing the current epoch when it becomes leader and committing
single voter changes with the new voter set before accepting another
voter change."

An easy example that shows the issue with auto formatting is:
1. Voter set is (1) by running --standalone.
2. Voter set to (1, 2, 3) after two AddVoter RPCs (either manually but
by auto joining).
3. Voter 1 loses its disk reformats back to (1). Now it is possible to
have two quorums one with just the replica 1 and one with the replicas
2 and 3.


Thanks. I'll update the KIP shortly to reflect my comments above,
--
-José

Reply via email to