I've updated the KIP with the following changes (the confluence diff is not helpful here since i rearranged some things)
* Added ZooKeeperBlockingKRaftMillis metric * Added section on new broker registration JSON * Removed section on MigrationCheck RPC * Added change to UpdateMetadataRequest * Added section "Additional ZK Broker Configs" (includes configs to connect to KRaft quorum) * Added section on "Incompatible Brokers" under Failure Modes * Clarified many things per this discussion thread I realized we need the KRaft controller to pick the correct "metadata.version" when initializing the migration. I included the IBP of a broker in its registration data so the KRaft controller can verify the IBP and pick the correct "metadata.version" when starting the migration. Otherwise, we could inadvertently downgrade the IBP/metadata.version as part of the migration. I also added "clusterId" to the broker registration data so the KRaft could verify it. -David On Mon, Oct 17, 2022 at 12:14 PM Colin McCabe <cmcc...@apache.org> wrote: > On Fri, Oct 14, 2022, at 11:39, Jun Rao wrote: > > Hi, Colin, > > > > 10. "That all goes away in the new mode, and we just have some code which > > analyzes __cluster_metadata and reflects it in 1) updates to ZK and 2) > > messages sent out to brokers." > > Hmm, I am not sure it's that simple. Some of the complexity of the > ZK-based > > controller are (1) using the pipelining approach to write to ZK for > better > > throughput and using conditional writes for correctness; (2) sending the > > proper LeaderAndIsr and UpdateMetadata requests. For example, during > > controller failover, the full metadata needs to be sent while during > > individual broker failure, only some of the metadata needs to be updated. > > The controlled shutdown handling sometimes uses StopReplicaRequest and > > some other times uses LeaderAndIsrRequest. (3) triggering new events > based > > on the responses of LeaderAndIsr (e.g. for topic deletion). Some of those > > complexity could be re-implemented in a more efficient way, but we need > to > > be really careful not to generate regression. Some of the other > complexity > > just won't go away. Reimplementing all those logic for the 30 or so > events > > in the ZK-based controller is possible, but seems a bit daunting and > risky. > > Hi Jun, > > Thanks for the response. I agree that there is some work here but I don't > think it's as bad as it might seem. Most of the code for writing to ZK can > be reused from the old controller. We should at least evaluate using the > old ControllerChannelManager as well. I don't know if we'll end up doing it > or not but it's a possibility. (The main reason to not do it is that the > response handling will be a bit different) > > The question of what to do during a controller failover is an interesting > one. Technically, we should be able to continue sending incremental updates > to the legacy nodes, for the same reason we can in KRaft mode. However, > probably we should just copy what ZK mode does and send them a full > metadata update. This will allow us to have an easy way to handle > divergences caused by lost RPCs by changing the controller (just as we do > in ZK mode, unfortunately). I suppose we should document this in the KIP... > > I agree controlled shutdown is tricky to get just right. I suppose this is > a case where the RPCs we send out are not purely "fire and forget"; we have > to listen for the response. But that can be done in an event-based way. > Controlled shutdown is also probably the last thing we'll implement once we > have the basic lift and shift. > > Topic deletion is another annoying thing. I wonder if we could just delete > immediately here and skip the complexity of implementing "deleting state." > Topic IDs will exist in IBP 3.4, in both ZK and KRaft mode, so it should be > OK, right? > > best, > Colin > > > > > > Thanks, > > > > Jun > > > > On Fri, Oct 14, 2022 at 9:29 AM Colin McCabe <cmcc...@apache.org> wrote: > > > >> On Thu, Oct 13, 2022, at 11:44, Jun Rao wrote: > >> > Hi, Colin, > >> > > >> > Thanks for the reply. > >> > > >> > 10. This is a bit on the implementation side. If you look at the > existing > >> > ZK-based controller, most of the logic is around maintaining an > in-memory > >> > state of all the resources (broker, topic, partition, etc), > >> reading/writing > >> > to ZK, sending LeaderAndIsr and UpdateMetadata requests and handling > the > >> > responses to brokers. So we need all that logic in the dual write > mode. > >> One > >> > option is to duplicate all that logic in some new code. This can be a > bit > >> > error prone and makes the code a bit harder to maintain if we need to > fix > >> > some critical issues in ZK-based controllers. Another option is to try > >> > reusing the existing code in the ZK-based controller. For example, we > >> could > >> > start the EventManager in the ZK-based controller, but let the KRaft > >> > controller ingest new events. This has its own challenges: (1) the > >> existing > >> > logic only logs ZK failures and doesn't expose them to the caller; (2) > >> the > >> > existing logic may add new events to the queue itself and we probably > >> need > >> > to think through how this is coordinated with the KRaft controller; > (3) > >> it > >> > registers some ZK listeners unnecessarily (may not be a big concern). > So > >> we > >> > need to get around those issues somehow. I am wondering if we have > >> > considered both options and which approach we are leaning towards for > the > >> > implementation. > >> > > >> > >> Yes, this is a good question. My take is that a big part of the > complexity > >> in the old controller code results from the fact that we use ZK as a > >> multi-writer database for propagating information between different > >> components. So in the old controller, every write to ZK needs to be > >> structured as a compare and swap to be fully correct. Every time we get > >> notified about something, it's usually in the form of "this znode > changed" > >> which prompts a full reload of part of the data in ZK (which itself has > >> multiple parts, loading, deserializing, reconciling, etc.) That all goes > >> away in the new mode, and we just have some code which analyzes > >> __cluster_metadata and reflects it in 1) updates to ZK and 2) messages > sent > >> out to brokers. > >> > >> This is pretty decoupled from the other logic in QuorumController and > >> should be easy to unit test, since the same inputs from the log always > >> produce the same output in ZK. Basically, ZK is write-only for us, we do > >> not read it (with the big exception of broker registration znodes) and I > >> think that will greatly simplify things. > >> > >> So I think dual-write mode as described here will be substantially > simpler > >> than trying to run part or all of the old controller in parallel. I do > >> think we will reuse a bunch of the serialization / deserialization code > for > >> znodes and possibly the code for communicating with ZK. > >> > >> best, > >> Colin > >> > >> > >> > > >> > 14. Good point and make sense. > >> > > >> > Thanks, > >> > > >> > Jun > >> > > >> > > >> > > >> > > >> > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <cmcc...@apache.org> > wrote: > >> > > >> >> Hi Jun, > >> >> > >> >> Thanks for taking a look. I can answer some questions here because I > >> >> collaborated on this a bit, and David is on vacation for a few days. > >> >> > >> >> On Wed, Oct 12, 2022, at 14:41, Jun Rao wrote: > >> >> > Hi, David, > >> >> > > >> >> > Thanks for the KIP. A few comments below. > >> >> > > >> >> > 10. It's still not very clear to me how the KRaft controller works > in > >> the > >> >> > dual writes mode to KRaft log and ZK when the brokers still run in > ZK > >> >> mode. > >> >> > Does the KRaft controller run a ZK based controller in parallel or > do > >> we > >> >> > derive what needs to be written to ZK based on KRaft controller > logic? > >> >> > >> >> We derive what needs to be written to ZK based on KRaft controller > >> logic. > >> >> > >> >> > I am also not sure how the KRaft controller handles broker > >> >> > registration/deregistration, since brokers are still running in ZK > >> mode > >> >> and > >> >> > are not heartbeating to the KRaft controller. > >> >> > >> >> The new controller will listen for broker registrations under > /brokers. > >> >> This is the only znode watch that the new controller will do. > >> >> > >> >> We did consider changing how ZK-based broker registration worked, > but it > >> >> just ended up being too much work for not enough gain. > >> >> > >> >> > > >> >> > 12. "A new set of nodes will be provisioned to host the controller > >> >> quorum." > >> >> > I guess we don't support starting the KRaft controller quorum on > >> existing > >> >> > brokers. It would be useful to make that clear. > >> >> > > >> >> > >> >> Agreed > >> >> > >> >> > 13. "Once the quorum is established and a leader is elected, the > >> >> controller > >> >> > will check the state of the cluster using the MigrationCheck RPC." > How > >> >> does > >> >> > the quorum controller detect other brokers? Does the controller > node > >> need > >> >> > to be configured with ZK connection string? If so, it would be > useful > >> to > >> >> > document the additional configs that the quorum controller needs to > >> set. > >> >> > > >> >> > >> >> Yes, the controllers monitor ZK for broker registrations, as I > mentioned > >> >> above. So they need zk.connect and the other ZK connection > >> configurations. > >> >> > >> >> > 14. "In order to prevent further writes to ZK, the first thing the > new > >> >> > KRaft quorum must do is take over leadership of the ZK controller. > " > >> The > >> >> ZK > >> >> > controller processing changes to /controller update asynchronously. > >> How > >> >> > does the KRaft controller know when the ZK controller has resigned > >> before > >> >> > it can safely copy the ZK data? > >> >> > > >> >> > >> >> This should be done through expectedControllerEpochZkVersion, just > like > >> in > >> >> ZK mode, right? We should bump this epoch value so that any writes > from > >> the > >> >> old controller will not go through. I agree we should spell this out > in > >> the > >> >> KIP. > >> >> > >> >> > 15. We have the following sentences. One says ControllerId is a > random > >> >> > KRaft broker and the other says it's the active controller. Which > one > >> is > >> >> > correct? > >> >> > "UpdateMetadata: for certain metadata changes, the KRaft controller > >> will > >> >> > need to send UpdateMetadataRequests to the ZK brokers. For the > >> >> > “ControllerId” field in this request, the controller should > specify a > >> >> > random KRaft broker." > >> >> > "In the UpdateMetadataRequest sent by the KRaft controller to the > ZK > >> >> > brokers, the ControllerId will point to the active controller which > >> will > >> >> be > >> >> > used for the inter-broker requests." > >> >> > > >> >> > >> >> Yeah, this seems like an error to me as well. A random value is not > >> really > >> >> useful. Plus the text here is self-contradictory, as you pointed out. > >> >> > >> >> I suspect what we should do here is add a new field, > KRaftControllerId, > >> >> and populate it with the real controller ID, and leave the old > >> controllerId > >> >> field as -1. A ZK-based broker that sees this can then consult its > >> >> controller.quorum.voters configuration to see where it should send > >> >> controller-bound RPCs. That (static) configuration lets us map > between > >> >> controller ID and host:port. > >> >> > >> >> We should still keep our existing epoch logic for deciding when > >> >> UpdateMetadataRequest / LeaderAndIsrRequests are stale, with the > caveat > >> >> that any kraft-based epoch should be treated as greater than any > >> ZK-based > >> >> epoch. After all, the kraft epoch is coming from the epoch of > >> >> __cluster_metadata, whereas the ZK epoch comes from ZK. > >> >> > >> >> > > >> >> > 16. "Additionally, the controller must specify if a broker in > >> >> “LiveBrokers” > >> >> > is KRaft or ZK." Does that require any protocol changes to > >> >> UpdateMetadata? > >> >> > > >> >> > >> >> Yeah, I am also curious why the we need to care whether brokers are > ZK > >> or > >> >> KRaft in UpdateMetadataRequest. We don't reveal this to clients, so > can > >> we > >> >> just leave this out? > >> >> > >> >> best, > >> >> Colin > >> >> > >> >> > Thanks, > >> >> > > >> >> > Jun > >> >> > > >> >> > On Wed, Oct 5, 2022 at 10:07 AM Mickael Maison < > >> mickael.mai...@gmail.com > >> >> > > >> >> > wrote: > >> >> > > >> >> >> Hi David, > >> >> >> > >> >> >> Thanks for starting this important KIP. > >> >> >> > >> >> >> I've just taken a quick look so far but I've got a couple of > initial > >> >> >> questions: > >> >> >> > >> >> >> 1) What happens if a non KRaft compatible broker (or with > >> >> >> kafka.metadata.migration.enable set to false) joins the cluster > after > >> >> >> the migration is triggered? > >> >> >> > >> >> >> 2) In the Failure Modes section you mention a scenario where a > write > >> >> >> to ZK fails. What happens when the divergence limit is reached? Is > >> >> >> this a fatal condition? How much divergence should we allow? > >> >> >> > >> >> >> Thanks, > >> >> >> Mickael > >> >> >> > >> >> >> On Wed, Oct 5, 2022 at 12:20 AM David Arthur <mum...@gmail.com> > >> wrote: > >> >> >> > > >> >> >> > Hey folks, I wanted to get the ball rolling on the discussion > for > >> the > >> >> >> > ZooKeeper migration KIP. This KIP details how we plan to do an > >> online > >> >> >> > migration of metadata from ZooKeeper to KRaft as well as a > rolling > >> >> >> > upgrade of brokers to KRaft mode. > >> >> >> > > >> >> >> > The general idea is to keep KRaft and ZooKeeper in sync during > the > >> >> >> > migration, so both types of brokers can exist simultaneously. > Then, > >> >> >> > once everything is migrated and updated, we can turn off > ZooKeeper > >> >> >> > writes. > >> >> >> > > >> >> >> > This is a pretty complex KIP, so please take a look :) > >> >> >> > > >> >> >> > > >> >> >> > >> >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration > >> >> >> > > >> >> >> > Thanks! > >> >> >> > David > >> >> >> > >> >> > >> > -- -David