Thanks for the discussion so far, everyone! Mickael
1) In this case, I think the KRaft controller could avoid sending UMR or LISR to this broker and not include it in LiveBrokers that are sent to the rest of the cluster. This would effectively fence it off from the rest of the cluster. The controller can report this as an error. 2) I agree with Colin's suggestion here. At some point, we'll need to block writes on the kraft controller. We might want a "time that ZK is blocking KRaft" metric here to accompany the "ZooKeeperWriteBehindLag". As we get on with the implementation, I think we'll be able to pick a reasonable size for the allowable backlog of writes to ZK. ------------ Jun 10) Colin answered this 11) Yes, that's right -- I'll clarify that. 12) +1 13) Colin answered this 14) As Colin mentioned, we achieve ZK controller fencing by bumping the controller epoch. I'll clarify this 15) The text is a bit unclear here. What I meant is we will return a random broker as ControllerId in the MetadataResponse sent to clients, but the active KRaft controller would be sent as the ControllerId in UpdateMetadataRequest. However, Colin raises a good point that we'll need a different code path on the ZK brokers for connecting to a KRaft controller. We should add a check on "controller.quorum.voters" as a readiness check on the ZK brokers. 16) I think we can remove this. This is a remnant of an earlier design without forwarding ------------ Colin 1) Yea, I like this. Since we should be able to statically determine our "readiness" on a broker, we can put this in the registration data. 2) It's not strictly required, no. However, it might be useful for the case of explicitly disallowing migrations on a cluster. For example, if someone brought up a KRaft quorum and misconfigured the "zk.connect", it could trigger a migration on some existing/unrelated cluster. This would be mostly harmless, but could cause some trouble for operators. I think of this config like an arming switch for the migration. 3) Sounds good 4) Sure, sounds good 5) Yea, agreed. We should fence these brokers -- see my response to Mickael above. ------------ Colin/Jun As Colin mentioned in his reply, we can react to records in the metadata log and update ZK and send out RPCs to brokers. Something like an asynchronous listener to the metadata log. Another important detail here is that in the KRaft controller we can work off a consistent in-memory snapshot of the metadata. This could be helpful for the case where our operation might take some time (like sending UMR to all the brokers, or making a lot of partition updates in ZK). During the time we are processing a record, we don't have to worry about the metadata changing out from under us. On Thu, Oct 13, 2022 at 2:44 PM Jun Rao <j...@confluent.io.invalid> wrote: > > Hi, Colin, > > Thanks for the reply. > > 10. This is a bit on the implementation side. If you look at the existing > ZK-based controller, most of the logic is around maintaining an in-memory > state of all the resources (broker, topic, partition, etc), reading/writing > to ZK, sending LeaderAndIsr and UpdateMetadata requests and handling the > responses to brokers. So we need all that logic in the dual write mode. One > option is to duplicate all that logic in some new code. This can be a bit > error prone and makes the code a bit harder to maintain if we need to fix > some critical issues in ZK-based controllers. Another option is to try > reusing the existing code in the ZK-based controller. For example, we could > start the EventManager in the ZK-based controller, but let the KRaft > controller ingest new events. This has its own challenges: (1) the existing > logic only logs ZK failures and doesn't expose them to the caller; (2) the > existing logic may add new events to the queue itself and we probably need > to think through how this is coordinated with the KRaft controller; (3) it > registers some ZK listeners unnecessarily (may not be a big concern). So we > need to get around those issues somehow. I am wondering if we have > considered both options and which approach we are leaning towards for the > implementation. > > 14. Good point and make sense. > > Thanks, > > Jun > > > > > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <cmcc...@apache.org> wrote: > > > Hi Jun, > > > > Thanks for taking a look. I can answer some questions here because I > > collaborated on this a bit, and David is on vacation for a few days. > > > > On Wed, Oct 12, 2022, at 14:41, Jun Rao wrote: > > > Hi, David, > > > > > > Thanks for the KIP. A few comments below. > > > > > > 10. It's still not very clear to me how the KRaft controller works in the > > > dual writes mode to KRaft log and ZK when the brokers still run in ZK > > mode. > > > Does the KRaft controller run a ZK based controller in parallel or do we > > > derive what needs to be written to ZK based on KRaft controller logic? > > > > We derive what needs to be written to ZK based on KRaft controller logic. > > > > > I am also not sure how the KRaft controller handles broker > > > registration/deregistration, since brokers are still running in ZK mode > > and > > > are not heartbeating to the KRaft controller. > > > > The new controller will listen for broker registrations under /brokers. > > This is the only znode watch that the new controller will do. > > > > We did consider changing how ZK-based broker registration worked, but it > > just ended up being too much work for not enough gain. > > > > > > > > 12. "A new set of nodes will be provisioned to host the controller > > quorum." > > > I guess we don't support starting the KRaft controller quorum on existing > > > brokers. It would be useful to make that clear. > > > > > > > Agreed > > > > > 13. "Once the quorum is established and a leader is elected, the > > controller > > > will check the state of the cluster using the MigrationCheck RPC." How > > does > > > the quorum controller detect other brokers? Does the controller node need > > > to be configured with ZK connection string? If so, it would be useful to > > > document the additional configs that the quorum controller needs to set. > > > > > > > Yes, the controllers monitor ZK for broker registrations, as I mentioned > > above. So they need zk.connect and the other ZK connection configurations. > > > > > 14. "In order to prevent further writes to ZK, the first thing the new > > > KRaft quorum must do is take over leadership of the ZK controller. " The > > ZK > > > controller processing changes to /controller update asynchronously. How > > > does the KRaft controller know when the ZK controller has resigned before > > > it can safely copy the ZK data? > > > > > > > This should be done through expectedControllerEpochZkVersion, just like in > > ZK mode, right? We should bump this epoch value so that any writes from the > > old controller will not go through. I agree we should spell this out in the > > KIP. > > > > > 15. We have the following sentences. One says ControllerId is a random > > > KRaft broker and the other says it's the active controller. Which one is > > > correct? > > > "UpdateMetadata: for certain metadata changes, the KRaft controller will > > > need to send UpdateMetadataRequests to the ZK brokers. For the > > > “ControllerId” field in this request, the controller should specify a > > > random KRaft broker." > > > "In the UpdateMetadataRequest sent by the KRaft controller to the ZK > > > brokers, the ControllerId will point to the active controller which will > > be > > > used for the inter-broker requests." > > > > > > > Yeah, this seems like an error to me as well. A random value is not really > > useful. Plus the text here is self-contradictory, as you pointed out. > > > > I suspect what we should do here is add a new field, KRaftControllerId, > > and populate it with the real controller ID, and leave the old controllerId > > field as -1. A ZK-based broker that sees this can then consult its > > controller.quorum.voters configuration to see where it should send > > controller-bound RPCs. That (static) configuration lets us map between > > controller ID and host:port. > > > > We should still keep our existing epoch logic for deciding when > > UpdateMetadataRequest / LeaderAndIsrRequests are stale, with the caveat > > that any kraft-based epoch should be treated as greater than any ZK-based > > epoch. After all, the kraft epoch is coming from the epoch of > > __cluster_metadata, whereas the ZK epoch comes from ZK. > > > > > > > > 16. "Additionally, the controller must specify if a broker in > > “LiveBrokers” > > > is KRaft or ZK." Does that require any protocol changes to > > UpdateMetadata? > > > > > > > Yeah, I am also curious why the we need to care whether brokers are ZK or > > KRaft in UpdateMetadataRequest. We don't reveal this to clients, so can we > > just leave this out? > > > > best, > > Colin > > > > > Thanks, > > > > > > Jun > > > > > > On Wed, Oct 5, 2022 at 10:07 AM Mickael Maison <mickael.mai...@gmail.com > > > > > > wrote: > > > > > >> Hi David, > > >> > > >> Thanks for starting this important KIP. > > >> > > >> I've just taken a quick look so far but I've got a couple of initial > > >> questions: > > >> > > >> 1) What happens if a non KRaft compatible broker (or with > > >> kafka.metadata.migration.enable set to false) joins the cluster after > > >> the migration is triggered? > > >> > > >> 2) In the Failure Modes section you mention a scenario where a write > > >> to ZK fails. What happens when the divergence limit is reached? Is > > >> this a fatal condition? How much divergence should we allow? > > >> > > >> Thanks, > > >> Mickael > > >> > > >> On Wed, Oct 5, 2022 at 12:20 AM David Arthur <mum...@gmail.com> wrote: > > >> > > > >> > Hey folks, I wanted to get the ball rolling on the discussion for the > > >> > ZooKeeper migration KIP. This KIP details how we plan to do an online > > >> > migration of metadata from ZooKeeper to KRaft as well as a rolling > > >> > upgrade of brokers to KRaft mode. > > >> > > > >> > The general idea is to keep KRaft and ZooKeeper in sync during the > > >> > migration, so both types of brokers can exist simultaneously. Then, > > >> > once everything is migrated and updated, we can turn off ZooKeeper > > >> > writes. > > >> > > > >> > This is a pretty complex KIP, so please take a look :) > > >> > > > >> > > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration > > >> > > > >> > Thanks! > > >> > David > > >> > > -- David Arthur