Hi, David, Thanks for the reply.
20/21. When KRaft runs in the combined mode, does a controller know whether an ApiRequest is from a client or another broker? 40. Adding a "None" state sounds reasonable. Thanks, Jun On Thu, Nov 3, 2022 at 8:39 AM David Arthur <david.art...@confluent.io.invalid> wrote: > Jun, > > 20/21 If we use a tagged field, then I don't think clients need to be > concerned with it, right?. In ApiVersionsResponse sent by brokers to > clients, this field would be omitted. Clients can't connect directly to the > KRaft controller nodes. Also, we have a precedent of sending controller > node state between controllers through ApiVersions ("metadata.version" > feature), so I think it fits well. > > 40. For new KRaft clusters, we could add another state to indicate it was > not migrated from ZK, but created as a KRaft cluster. Maybe > "kafka.controller:type=KafkaController,name=MigrationState" => "None" ? We > could also omit that metric for unmigrated clusters, but I'm not a fan of > using the absence of a value to signal something. > > ----- > > Akhilesh, thanks for reviewing the KIP! > > 1. MigrationState and MetadataType are mostly the same on the controller, > but we do have the "MigratingZkData" state that seems useful to report as a > metric. Aside from looking at logs, observing the controller in this state > is the only way to see how long its taking to copy data from ZK. > > "KRaftMode" instead of "MigrationFinalized" is similar to Jun's question > about non-migrated clusters. I think it's useful to have a distinct > MigrationState for clusters that have been migrated and those that were > never migrated. This does mean we'll report the MigrationState long after > the migration is complete, but we can drop these metrics in 4.0 once ZK is > removed. > > 2. The "ZkMigrationReady" will indicate that the controller has > "kafka.metadata.migration.enable" _and_ the ZK configs set. We need some > way to indicate that the whole quorum is correctly configured to handle the > migration so we don't failover to a controller that's not configured for > ZK. Did I understand your question correctly? > > 3. Yea, good idea. While the KRaft controller has > MigrationState=MigrationIneligible, we could also report > "kafka.controller:type=KafkaController,name=MigrationInelgibleBrokerCount". > It might be useful to report ineligible controllers as well since that can > prevent the migration from starting. > > 4. I think I covered this in "Incompatible Brokers". We effectively fence > these brokers by not sending them metadata RPCs. > > Thanks! > David > > > On Tue, Nov 1, 2022 at 3:58 PM Akhilesh Chaganti <akhilesh....@gmail.com> > wrote: > > > Hi David, > > > > > > Thanks for the KIP. I have some questions/suggestions. > > > > > > 1) I see two new metrics: > > kafka.controller:type=KafkaController,name=MetadataType and > > kafka.controller:type=KafkaController,name=MigrationState. Won't the > second > > metric already cover the cases of the first metric? Also, instead of > > MigrationFinalized, we could directly say the state is KRaftMode. So we > can > > use the same value for default KRaft clusters. > > > > > > 2) ZkMigrationReady in ApiVersionsResponse from KRaft Controller. By > > default, we plan to start the Controller quorum in " > > *kafka.metadata.migration.enable*" config set to true. Then do we need > this > > additional information again to make sure The controllers are ready for > > migration? What would happen if the Controller assumes it is ready for > > migration from 3.4 by default if it doesn't see both MigrationMetadata > > records? > > > > > > 3) I see that we do not impose order on rolling the brokers with > migration > > flags and provisioning the controller quorum. Along with the KRaft > > controller emitting migration state metrics, it may be better to emit the > > broker count for the brokers not ready for migration yet. This will give > us > > more insight into any roll issues. > > > > > > 4) Once the KRaft controller is in migration mode, we should also > > prevent/handle ZkBrokerRegistrations that don't enable migration mode. > > > > > > Thanks > > Akhilesh > > > > > > On Tue, Nov 1, 2022 at 10:49 AM Jun Rao <j...@confluent.io.invalid> > wrote: > > > > > Hi, David, > > > > > > Thanks for the reply. > > > > > > 20/21. Regarding the new ZkMigrationReady field in ApiVersionsResponse, > > it > > > seems that this is a bit intrusive since it exposes unneeded info to > the > > > clients. Another option is to add that field as part of the Fetch > > request. > > > We can choose to only set that field in the very first Fetch request > > from a > > > Quorum follower. > > > > > > 40. For kafka.controller:type=KafkaController,name=MigrationState, what > > is > > > the value for a brand new KRaft cluster? > > > > > > Jun > > > > > > On Mon, Oct 31, 2022 at 2:35 PM David Arthur > > > <david.art...@confluent.io.invalid> wrote: > > > > > > > 30. I think we can keep the single ControllerId field in those > requests > > > > since they are only used for fencing (as far as I know). Internally, > > the > > > > broker components that handle those requests will compare the > > > ControllerId > > > > with that of MetadataCache (which is updated via UMR). > > > > > > > > The reason we need the separate KRaftControllerId in the > UpdateMetadata > > > > code path so that we can have different connection behavior for a > KRaft > > > > controller vs ZK controller. > > > > > > > > 31. It seems reasonable to keep the MigrationRecord in the snapshot. > I > > > was > > > > thinking the same thing in terms of understanding the loss for a > > > > migration-after-finalization. However, once a snapshot has been taken > > > that > > > > includes the final MigrationRecord, we can't easily see which records > > > came > > > > after it. > > > > > > > > 32. You're correct, we can just use the modify time from the Stat. > The > > > > other two fields are primarily informational and are there for > > operators > > > > who want to inspect the state of the migration. They aren't required > > for > > > > correctness > > > > > > > > 33. Yes that's right. I detail that in "Controller Leadership" > section > > > > > > > > 34. Right, I'll fix that. > > > > > > > > Thanks, > > > > David > > > > > > > > On Mon, Oct 31, 2022 at 2:55 PM Jun Rao <j...@confluent.io.invalid> > > > wrote: > > > > > > > > > Hi, David, > > > > > > > > > > Thanks for the updated KIP. A few more comments. > > > > > > > > > > 30. LeaderAndIsrRequest/StopReplicaRequest both have a controllerId > > > > field. > > > > > Should we add a KRaftControllerId field like UpdateMetadata? > > > > > > > > > > 31. "If a migration has been finalized, but the KRaft quroum comes > up > > > > with > > > > > kafka.metadata.migration.enable, we must not re-enter the migration > > > mode. > > > > > In this case, while replaying the log, the controller can see the > > > second > > > > > MigrationRecord and know that the migration is finalized and should > > not > > > > be > > > > > resumed. " Hmm, do we want to keep the MigrationRecord in the > > snapshot > > > > and > > > > > the metadata log forever after migration is finalized? If not, we > > can't > > > > > know for sure whether a migration has happened or not. Also, it > might > > > be > > > > > useful to support switching back to ZK mode after the migration is > > > > > finalized, with the understanding of potential metadata loss. In > that > > > > case, > > > > > we could just trim all metadata log and recopy the ZK metadata > back. > > > > > > > > > > 32. The /migration node in ZK: Do we need last_update_time_ms since > > ZK > > > > > Stats already has an MTime? Also, how do we plan to use the > > > > > kraft_controller_id and kraft_controller_epoch fields? > > > > > > > > > > 33. Controller migration: We will force a write to the > "/controller" > > > and > > > > > "/controller_epoch" ZNodes before copying ZK data, right? > > > > > > > > > > 34. "Operator can remove the persistent "/controller" and > > > > > "/controller_epoch" nodes allowing for ZK controller election to > take > > > > > place". I guess the operator only needs to remove the /controller > > path? > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Mon, Oct 31, 2022 at 7:17 AM David Arthur > > > > > <david.art...@confluent.io.invalid> wrote: > > > > > > > > > > > Happy Monday, everyone! I've updated the KIP with the following > > > > changes: > > > > > > > > > > > > * Clarified MetadataType metric usages (broker vs controller) > > > > > > * Added ZkMigrationReady tagged field to ApiVersionsResponse (for > > use > > > > by > > > > > > KRaft controller quorum) > > > > > > * Added MigrationRecord with two states: Started and Finished > > > > > > * Documented ZK configs for KRaft controller > > > > > > * Simplified state machine description (internally, more states > > will > > > > > exist, > > > > > > but only the four documented are interesting to operators) > > > > > > * Clarified some things in Controller Migration section > > > > > > * Removed KRaft -> ZK parts of Broker Registration > > > > > > * Added Misconfigurations section to Failure Modes > > > > > > > > > > > > Let me know if I've missed anything from the past two weeks of > > > > > discussion. > > > > > > > > > > > > Thanks again to everyone who has reviewed this KIP so far! > > > > > > > > > > > > -David > > > > > > > > > > > > On Fri, Oct 28, 2022 at 2:26 PM Jun Rao <j...@confluent.io.invalid > > > > > > > wrote: > > > > > > > > > > > > > Hi, David, > > > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > > > 20/21. Sounds good. > > > > > > > > > > > > > > Could you update the doc with all the changes being discussed? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > On Fri, Oct 28, 2022 at 10:11 AM David Arthur > > > > > > > <david.art...@confluent.io.invalid> wrote: > > > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > > > 20/21. I was also wondering about a "migration" record. In > > > addition > > > > > to > > > > > > > the > > > > > > > > scenario you mentioned, we also need a way to prevent the > > cluster > > > > > from > > > > > > > > re-entering the dual write mode after the migration has been > > > > > > finalized. I > > > > > > > > could see this happening inadvertently via a change in some > > > > > > configuration > > > > > > > > management system. How about we add a record that marks the > > > > beginning > > > > > > and > > > > > > > > end of the dual-write mode. The first occurrence of the > record > > > > could > > > > > be > > > > > > > > included in the metadata transaction when we migrate data > from > > > ZK. > > > > > > > > > > > > > > > > With this, the active controller would decide whether to > enter > > > dual > > > > > > write > > > > > > > > mode, finalize the migration based, or fail based on: > > > > > > > > > > > > > > > > * Metadata log state > > > > > > > > * It's own configuration ("kafka.metadata.migration.enable", > > > > > > > > "zookeeper.connect", etc) > > > > > > > > * The other controllers configuration (via > ApiVersionsResponse) > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > 22. Since we will need the fencing anyways as a safe-guard, > > then > > > I > > > > > > agree > > > > > > > > would could skip the registration of KRaft brokers in ZK to > > > simply > > > > > > > things a > > > > > > > > bit. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 27, 2022 at 5:11 PM Jun Rao > > <j...@confluent.io.invalid > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, David, > > > > > > > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > > > > > > > 20/21. Relying upon the presence of ZK configs to determine > > > > whether > > > > > > the > > > > > > > > > KRaft controller is in a dual write mode seems a bit error > > > prone. > > > > > If > > > > > > > > > someone accidentally adds a ZK configuration to a brand new > > > KRaft > > > > > > > > cluster, > > > > > > > > > ideally it shouldn't cause the controller to get into a > weird > > > > > state. > > > > > > > Have > > > > > > > > > we considered storing the migration state in a metadata > > record? > > > > > > > > > > > > > > > > > > 22. If we have the broker fencing logic, do we need to > write > > > the > > > > > > broker > > > > > > > > > registration path in ZK for KRaft brokers at all? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 27, 2022 at 1:02 PM David Arthur > > > > > > > > > <david.art...@confluent.io.invalid> wrote: > > > > > > > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > > > > > > > 20/21. A KRaft controller will recover the migration > state > > by > > > > > > reading > > > > > > > > the > > > > > > > > > > "/migration" ZNode. If the migration enable config is > set, > > > and > > > > > the > > > > > > ZK > > > > > > > > > > migration is complete, it will enter the dual-write mode. > > > > Before > > > > > an > > > > > > > > > > operator can decommission ZK, they will need to finalize > > the > > > > > > > migration > > > > > > > > > > which involves removing the migration config and the ZK > > > config. > > > > > > I'll > > > > > > > > > > clarify this in the KIP. > > > > > > > > > > > > > > > > > > > > 22. Yea, we could see an incorrect broker ID during that > > > > window. > > > > > > If > > > > > > > we > > > > > > > > > > ended up with a state where we saw a ZK broker ID that > > > > conflicted > > > > > > > with > > > > > > > > a > > > > > > > > > > KRaft broker ID, we would need to fence one of them. I > > would > > > > > > probably > > > > > > > > opt > > > > > > > > > > to fence the KRaft broker in that case since broker > > > > registration > > > > > > and > > > > > > > > > > fencing is more robust in KRaft. Hopefully this is a rare > > > case. > > > > > > > > > > > > > > > > > > > > 26. Sounds good. > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 27, 2022 at 1:34 PM Jun Rao > > > > <j...@confluent.io.invalid > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, David, > > > > > > > > > > > > > > > > > > > > > > Thanks for the reply. A few more comments. > > > > > > > > > > > > > > > > > > > > > > 20/21. Using a tagged field in ApiVersionRequest could > > > work. > > > > > > > Related > > > > > > > > to > > > > > > > > > > > this, how does a KRaft controller know that it's in the > > > dual > > > > > > write > > > > > > > > > mode? > > > > > > > > > > > Does it need to read the /controller path from ZK? > After > > > the > > > > > > > > migration, > > > > > > > > > > > people may have the ZK cluster decommissioned, but > still > > > have > > > > > the > > > > > > > ZK > > > > > > > > > > > configs left in the KRaft controller. Will this cause > the > > > > KRaft > > > > > > > > > > controller > > > > > > > > > > > to be stuck because it doesn't know which mode it is > in? > > > > > > > > > > > > > > > > > > > > > > 22. Using the ephemeral node matches the current > ZK-based > > > > > broker > > > > > > > > > behavior > > > > > > > > > > > better. However, it leaves a window for incorrect > broker > > > > > > > registration > > > > > > > > > to > > > > > > > > > > > sneak in during KRaft controller failover. > > > > > > > > > > > > > > > > > > > > > > 26. Then, we could just remove Broker Registration in > > that > > > > > > section. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 26, 2022 at 2:21 PM David Arthur > > > > > > > > > > > <david.art...@confluent.io.invalid> wrote: > > > > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > 20/21 It could definitely cause problems if we > failover > > > to > > > > a > > > > > > > > > controller > > > > > > > > > > > > without "kafka.metadata.migration.enable". The only > > > > > mechanism I > > > > > > > > know > > > > > > > > > of > > > > > > > > > > > for > > > > > > > > > > > > controllers to learn things about one another is > > > > ApiVersions. > > > > > > We > > > > > > > > > > > currently > > > > > > > > > > > > use this for checking support for "metadata.version" > > (in > > > > > KRaft > > > > > > > > mode). > > > > > > > > > > We > > > > > > > > > > > > could add a "zk.migration" feature flag that's > enabled > > > on a > > > > > > > > > controller > > > > > > > > > > > only > > > > > > > > > > > > if the config is set. Another possibility would be a > > > tagged > > > > > > field > > > > > > > > on > > > > > > > > > > > > ApiVersionResponse that indicated if the config was > > set. > > > > Both > > > > > > > seem > > > > > > > > > > > somewhat > > > > > > > > > > > > inelegant. I think a tagged field would be a bit > > simpler > > > > (and > > > > > > > > > arguably > > > > > > > > > > > less > > > > > > > > > > > > hacky). > > > > > > > > > > > > > > > > > > > > > > > > For 20, we could avoid entering the migration state > > > unless > > > > > the > > > > > > > > whole > > > > > > > > > > > quorum > > > > > > > > > > > > had the field present in their NodeApiVersions. For > 21, > > > we > > > > > > could > > > > > > > > > avoid > > > > > > > > > > > > leaving the migration state unless the whole quorum > did > > > not > > > > > > have > > > > > > > > the > > > > > > > > > > > field > > > > > > > > > > > > in their NodeApiVersions. Do you think this would be > > > > > > sufficient? > > > > > > > > > > > > > > > > > > > > > > > > 22. Right, we need to write the broker info back to > ZK > > > just > > > > > as > > > > > > a > > > > > > > > > > > safeguard > > > > > > > > > > > > against incorrect broker IDs getting registered into > > ZK. > > > I > > > > > was > > > > > > > > > thinking > > > > > > > > > > > > these would be persistent nodes, but it's probably > fine > > > to > > > > > make > > > > > > > > them > > > > > > > > > > > > ephemeral and have the active KRaft controller keep > > them > > > up > > > > > to > > > > > > > > date. > > > > > > > > > > > > > > > > > > > > > > > > 23. Right. When the broker comes up as a KRaft > broker, > > > it's > > > > > old > > > > > > > > > > > > /brokers/ids ZNode will be gone and it will register > > > itself > > > > > > with > > > > > > > > the > > > > > > > > > > > KRaft > > > > > > > > > > > > controller. The controller will know that it is now > in > > > > KRaft > > > > > > mode > > > > > > > > and > > > > > > > > > > > will > > > > > > > > > > > > stop sending it the old RPCs. > > > > > > > > > > > > > > > > > > > > > > > > 24. Ok, I'll add these > > > > > > > > > > > > > > > > > > > > > > > > 25. I realize now the "/controller_epoch" node is > > already > > > > > > > > persistent. > > > > > > > > > > It > > > > > > > > > > > > should be sufficient to remove the "/controller" node > > to > > > > > > trigger > > > > > > > an > > > > > > > > > > > > election. > > > > > > > > > > > > > > > > > > > > > > > > 26. Hmm, not sure, but I don't think it uses watches. > > > > > > KafkaServer > > > > > > > > > > > registers > > > > > > > > > > > > the broker info and later loads all brokers in the > > > cluster > > > > to > > > > > > > check > > > > > > > > > > that > > > > > > > > > > > > the endpoints don't conflict. Things that do use > > watches > > > > are > > > > > > > > dynamic > > > > > > > > > > > > configs, ACLs, and some others (listed in the KIP). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 26, 2022 at 4:26 PM David Arthur < > > > > > > > > > > david.art...@confluent.io> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Luke and Andrew, thanks for taking a look! > > > > > > > > > > > > > > > > > > > > > > > > > > I think the names of the state machine in the KIP > > > aren't > > > > > the > > > > > > > > best. > > > > > > > > > > I'll > > > > > > > > > > > > > try to improve that section in general. > > > > > > > > > > > > > > > > > > > > > > > > > > 1. "MigrationReady" is probably better named > > > > > > "MigratingFromZk" > > > > > > > or > > > > > > > > > > > > > something. It's meant to be the state when the > KRaft > > > > > > controller > > > > > > > > is > > > > > > > > > > > > actively > > > > > > > > > > > > > migrating the data out of ZK. This happens after we > > > have > > > > > > > detected > > > > > > > > > > that > > > > > > > > > > > > the > > > > > > > > > > > > > cluster is eligible and before we enter the dual > > write > > > > > mode. > > > > > > > > > > > > > > > > > > > > > > > > > > 2. "MigrationActive" should probably be called > > > > > > > "BrokerMigration". > > > > > > > > > > > > > Transitioning to "MigrationFinished" can happen > > > > > automatically > > > > > > > > when > > > > > > > > > > all > > > > > > > > > > > > the > > > > > > > > > > > > > known ZK brokers have been migrated. Since we don't > > > have > > > > > > > > permanent > > > > > > > > > > > > > registrations of ZK brokers, we can use the > partition > > > > > > > assignments > > > > > > > > > as > > > > > > > > > > a > > > > > > > > > > > > > proxy for this. > > > > > > > > > > > > > > > > > > > > > > > > > > 3. A metric for unready brokers makes sense. We can > > > also > > > > > log > > > > > > a > > > > > > > > > > message > > > > > > > > > > > on > > > > > > > > > > > > > the controller when it tries to start the migration > > > > > > > > > > > > > > > > > > > > > > > > > > 4. The MetadataType metric is meant to help see > this. > > > > E.g., > > > > > > > some > > > > > > > > > > > brokers > > > > > > > > > > > > > would have MetadataType=ZK, some would have > > > > > MetadataType=Dual > > > > > > > > > > > > > > > > > > > > > > > > > > 5. On ZK brokers, not having this config set would > > > > prevent > > > > > > the > > > > > > > > new > > > > > > > > > > > broker > > > > > > > > > > > > > registration data from being written to ZK. This > > means > > > > the > > > > > > > KRaft > > > > > > > > > > > > controller > > > > > > > > > > > > > won't send RPCs to it. If a broker has been > migrated > > to > > > > > KRaft > > > > > > > > > > already, > > > > > > > > > > > > I'm > > > > > > > > > > > > > not sure there is any harm in removing this config. > > If > > > we > > > > > > > decide > > > > > > > > > that > > > > > > > > > > > we > > > > > > > > > > > > > need to guarantee that KRaft brokers have this > config > > > set > > > > > > > during > > > > > > > > > the > > > > > > > > > > > > > migration, we can include it in the broker > > registration > > > > > > that's > > > > > > > > sent > > > > > > > > > > to > > > > > > > > > > > > > KRaft. This would let the controller keep that > broker > > > > > fenced. > > > > > > > > > > > > > > > > > > > > > > > > > > 6. Once the KRaft controller takes leadership, the > ZK > > > > > > > controller > > > > > > > > > > won't > > > > > > > > > > > be > > > > > > > > > > > > > active any more and will stop reporting the > > > > MigrationState > > > > > > > > metric. > > > > > > > > > > > > > > > > > > > > > > > > > > 7. The "Dual" MetadataType is reported by brokers > > > running > > > > > in > > > > > > > > KRaft > > > > > > > > > > mode > > > > > > > > > > > > > when the migration enable config is set. I think > the > > > > > > controller > > > > > > > > > > should > > > > > > > > > > > > also > > > > > > > > > > > > > report this, not just the brokers. I'll clarify in > > the > > > > KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Andrew: > > > > > > > > > > > > > > > > > > > > > > > > > > > How will the code get all the ZooKeeper config? > > Will > > > it > > > > > do > > > > > > > some > > > > > > > > > > sort > > > > > > > > > > > of > > > > > > > > > > > > > scan of the ZooKeeper data store? ... Do we need to > > do > > > > > > anything > > > > > > > > > > special > > > > > > > > > > > > to > > > > > > > > > > > > > make sure we get all data from ZooKeeper? > > > > > > > > > > > > > > > > > > > > > > > > > > With a few exceptions, all data written to ZK by > > Kafka > > > > > > happens > > > > > > > on > > > > > > > > > the > > > > > > > > > > > > > controller (single writer principle). In our > > migration > > > > > code, > > > > > > we > > > > > > > > can > > > > > > > > > > > > > unconditionally update the "/controller" and > > > > > > > "/controller_epoch" > > > > > > > > > > ZNodes > > > > > > > > > > > > and > > > > > > > > > > > > > effectively force the migration component to gain > > > > > leadership > > > > > > of > > > > > > > > the > > > > > > > > > > ZK > > > > > > > > > > > > > controller. Once this happens, we don't expect any > > data > > > > to > > > > > be > > > > > > > > > written > > > > > > > > > > > to > > > > > > > > > > > > > ZK, so we can read it iteratively without worrying > > > about > > > > > any > > > > > > > > > > concurrent > > > > > > > > > > > > > writes. > > > > > > > > > > > > > > > > > > > > > > > > > > As for the linearizable thing, I believe this just > > > means > > > > > that > > > > > > > > reads > > > > > > > > > > may > > > > > > > > > > > > be > > > > > > > > > > > > > served by a quorum follower which has stale data. > > We'll > > > > be > > > > > > > > reading > > > > > > > > > > from > > > > > > > > > > > > ZK > > > > > > > > > > > > > the same way the ZK controller does, so I think we > > will > > > > be > > > > > > fine > > > > > > > > > > > regarding > > > > > > > > > > > > > consistency. If we wanted to be extra careful, we > > could > > > > > add a > > > > > > > > delay > > > > > > > > > > > prior > > > > > > > > > > > > > to iterating through the znodes to give partitioned > > ZK > > > > > > > followers > > > > > > > > a > > > > > > > > > > > chance > > > > > > > > > > > > > to get kicked out of the quorum. I don't think > we'll > > > need > > > > > > that > > > > > > > > > though > > > > > > > > > > > > > > > > > > > > > > > > > > > What happens if we commit the transaction then > fail > > > > right > > > > > > > after > > > > > > > > > > that > > > > > > > > > > > > and > > > > > > > > > > > > > restart? > > > > > > > > > > > > > > > > > > > > > > > > > > If we commit the "migration" transaction to the > > > metadata > > > > > log, > > > > > > > it > > > > > > > > > > won't > > > > > > > > > > > be > > > > > > > > > > > > > a problem if we failover or restart. We can recover > > our > > > > > state > > > > > > > > based > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > > metadata log and what exists in ZK. If we see a > > > migration > > > > > > > > > transaction > > > > > > > > > > > in > > > > > > > > > > > > > the kraft log, but no "/migration" ZNode, we'll > know > > we > > > > > > failed > > > > > > > > > before > > > > > > > > > > > > > writing to ZK. If we see a partial transaction in > the > > > > log, > > > > > > then > > > > > > > > we > > > > > > > > > > can > > > > > > > > > > > > > abort it and restart the migration. > > > > > > > > > > > > > > > > > > > > > > > > > > > This sort of leads to me wondering if we > > will/should > > > > > check > > > > > > > the > > > > > > > > > > > metadata > > > > > > > > > > > > > log state before doing the migration? That could > also > > > > catch > > > > > > > > > mistakes > > > > > > > > > > > such > > > > > > > > > > > > > as if a KRaft quorum is used that already had some > > > > metadata > > > > > > > > which I > > > > > > > > > > > > assume > > > > > > > > > > > > > we want to prevent. > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, we should check that the metadata log is empty > > > > before > > > > > > > > > > attempting a > > > > > > > > > > > > > migration (perhaps excepting the bootstrap metadata > > -- > > > > need > > > > > > to > > > > > > > > > think > > > > > > > > > > on > > > > > > > > > > > > > that one still). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 24, 2022 at 8:57 AM Andrew Grant > > > > > > > > > > > <agr...@confluent.io.invalid > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > >> Hey David, > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> Thanks for the KIP. I had a few small questions. > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> “The ZK data migration will copy the existing ZK > > data > > > > into > > > > > > the > > > > > > > > > KRaft > > > > > > > > > > > > >> metadata log and establish the new KRaft active > > > > controller > > > > > > as > > > > > > > > the > > > > > > > > > > > active > > > > > > > > > > > > >> controller from a ZK perspective.” > > > > > > > > > > > > >> > > > > > > > > > > > > >> How will the code get all the ZooKeeper config? > Will > > > it > > > > do > > > > > > > some > > > > > > > > > sort > > > > > > > > > > > of > > > > > > > > > > > > >> scan of the ZooKeeper data store? Also I’m not a > > > > ZooKeeper > > > > > > > > expert > > > > > > > > > > but > > > > > > > > > > > > per > > > > > > > > > > > > >> > > > > > > > > > > https://zookeeper.apache.org/doc/current/zookeeperInternals.html > > > > > > > > > > > “Read > > > > > > > > > > > > >> operations in ZooKeeper are *not linearizable* > since > > > > they > > > > > > can > > > > > > > > > return > > > > > > > > > > > > >> potentially stale data.” Do we need to do anything > > > > special > > > > > > to > > > > > > > > make > > > > > > > > > > > sure > > > > > > > > > > > > we > > > > > > > > > > > > >> get all data from ZooKeeper? > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> “For the initial migration, the controller will > > > utilize > > > > > > > KIP-868 > > > > > > > > > > > Metadata > > > > > > > > > > > > >> Transactions > > > > > > > > > > > > >> < > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-868+Metadata+Transactions > > > > > > > > > > > > >> > > > > > > > > > > > > > >> to > > > > > > > > > > > > >> write all of the ZK metadata in a single > > transaction. > > > If > > > > > the > > > > > > > > > > > controller > > > > > > > > > > > > >> fails before this transaction is finalized, the > next > > > > > active > > > > > > > > > > controller > > > > > > > > > > > > >> will > > > > > > > > > > > > >> abort the transaction and restart the migration > > > > process.” > > > > > > What > > > > > > > > > > happens > > > > > > > > > > > > if > > > > > > > > > > > > >> we commit the transaction then fail right after > that > > > and > > > > > > > > restart? > > > > > > > > > > This > > > > > > > > > > > > >> sort > > > > > > > > > > > > >> of leads to me wondering if we will/should check > the > > > > > > metadata > > > > > > > > log > > > > > > > > > > > state > > > > > > > > > > > > >> before doing the migration? That could also catch > > > > mistakes > > > > > > > such > > > > > > > > as > > > > > > > > > > if > > > > > > > > > > > a > > > > > > > > > > > > >> KRaft quorum is used that already had some > metadata > > > > which > > > > > I > > > > > > > > assume > > > > > > > > > > we > > > > > > > > > > > > want > > > > > > > > > > > > >> to prevent. > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> For the Test Plan section do you think it’s worth > > > > calling > > > > > > out > > > > > > > > > > > > performance > > > > > > > > > > > > >> testing migrations of ZooKeeper deployments that > > have > > > > > > “large” > > > > > > > > > > amounts > > > > > > > > > > > of > > > > > > > > > > > > >> metadata? > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> Thanks, > > > > > > > > > > > > >> > > > > > > > > > > > > >> Andrew > > > > > > > > > > > > >> > > > > > > > > > > > > >> On Mon, Oct 24, 2022 at 3:20 AM Luke Chen < > > > > > > show...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > >> > Hi David, > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > Thanks for the good and complicated proposal! :) > > > > > > > > > > > > >> > Some questions: > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 1. "MigrationReady" state: The KIP said: > > > > > > > > > > > > >> > The KRaft quorum has been started > > > > > > > > > > > > >> > I don't think we'll automatically enter this > state > > > > after > > > > > > > KRaft > > > > > > > > > > > quorum > > > > > > > > > > > > >> > started, do we? > > > > > > > > > > > > >> > I think KRaft active controller should take over > > > > > > leadership > > > > > > > by > > > > > > > > > > > > writing a > > > > > > > > > > > > >> > znode in /controller path before we entering > this > > > > sate, > > > > > is > > > > > > > > that > > > > > > > > > > > > correct? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 2. "MigrationActive" state: the KIP said: > > > > > > > > > > > > >> > ZK state has been migrated, controller is in > > > > dual-write > > > > > > > mode, > > > > > > > > > > > brokers > > > > > > > > > > > > >> are > > > > > > > > > > > > >> > being restarted in KRaft mode > > > > > > > > > > > > >> > What confuses me is the last part: "brokers are > > > being > > > > > > > > restarted > > > > > > > > > in > > > > > > > > > > > > KRaft > > > > > > > > > > > > >> > mode". > > > > > > > > > > > > >> > How could we detect brokers are being restarted > in > > > > KRaft > > > > > > > mode? > > > > > > > > > Old > > > > > > > > > > > ZK > > > > > > > > > > > > >> > broker is removed and new KRaft broker is up > > within > > > N > > > > > > > minutes? > > > > > > > > > > > > >> > I think we don't have to rely on the condition > > > > "brokers > > > > > > are > > > > > > > > > being > > > > > > > > > > > > >> restarted > > > > > > > > > > > > >> > in KRaft mode" to enter this state. > > > > > > > > > > > > >> > "brokers are being restarted in KRaft mode" > should > > > be > > > > a > > > > > > > > process > > > > > > > > > > > > happened > > > > > > > > > > > > >> > between "MigrationActive" and > "MigrationFinished". > > > > Does > > > > > > that > > > > > > > > > make > > > > > > > > > > > > sense? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 3. When "Zookeeper" mode trying to enter > > > > > > > "MigrationEligible", > > > > > > > > if > > > > > > > > > > > there > > > > > > > > > > > > >> is a > > > > > > > > > > > > >> > cluster with tens of brokers, how could users > know > > > why > > > > > > this > > > > > > > > > > cluster > > > > > > > > > > > > >> cannot > > > > > > > > > > > > >> > be in "MigrationEligible" state, yet? Check that > > > znode > > > > > > > > manually > > > > > > > > > > one > > > > > > > > > > > by > > > > > > > > > > > > >> one? > > > > > > > > > > > > >> > Or do we plan to create a tool to help them? Or > > > maybe > > > > > > expose > > > > > > > > the > > > > > > > > > > > > >> "unready > > > > > > > > > > > > >> > ZK brokers" metrics? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 4. Same for "MigrationActive" entering > > > > > "MigrationFinished" > > > > > > > > > state. > > > > > > > > > > > > Since > > > > > > > > > > > > >> > that will be some process for restarting ZK > broker > > > > into > > > > > > > KRaft > > > > > > > > > > broker > > > > > > > > > > > > >> one by > > > > > > > > > > > > >> > one, could we expose the "remaining ZK brokers" > as > > > > > > metrics? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 5. When users are in > > > > > > > > > > > > >> > > "MigrationReady"/"MigrationActive"/"MigrationFinished" > > > > > > > > > > > > >> > states, and they accidentally change the KRaft > > > > > controller > > > > > > > > > config: > > > > > > > > > > > > >> > "kafka.metadata.migration.enable" > > > > > > > > > > > > >> > to false. What will happen to this cluster? No > > > > > dual-write > > > > > > > for > > > > > > > > > it? > > > > > > > > > > > Will > > > > > > > > > > > > >> we > > > > > > > > > > > > >> > have any protection for it? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 6. About the "MigrationState" metric: > > > > > > > > > > > > >> > The "ZooKeeper" and "MigrationEligible" is > > reported > > > by > > > > > ZK > > > > > > > > > > > controller, > > > > > > > > > > > > >> and > > > > > > > > > > > > >> > the rest of states are reported by KRaft > > controller > > > > --> > > > > > > > makes > > > > > > > > > > sense > > > > > > > > > > > > to > > > > > > > > > > > > >> me. > > > > > > > > > > > > >> > One question from it is, when KRaft controller > > takes > > > > > over > > > > > > > the > > > > > > > > > > > > leadership > > > > > > > > > > > > >> > from ZK controller, what will the > "MigrationState" > > > > value > > > > > > in > > > > > > > > old > > > > > > > > > ZK > > > > > > > > > > > > >> > controller? keep in "MigrationEligible" doesn't > > make > > > > > > sense. > > > > > > > > Will > > > > > > > > > > > there > > > > > > > > > > > > >> be a > > > > > > > > > > > > >> > empty or null state? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > 7. About the "MetadataType" metric: > > > > > > > > > > > > >> > An enumeration of: ZooKeeper (1), Dual (2), > KRaft > > > (3). > > > > > > Each > > > > > > > > > broker > > > > > > > > > > > > >> reports > > > > > > > > > > > > >> > this. > > > > > > > > > > > > >> > I don't know how we could map the migration > state > > to > > > > > > these 3 > > > > > > > > > > types. > > > > > > > > > > > > >> > What is the metadataType when cluster in > > > > > "MigrationReady" > > > > > > > > state? > > > > > > > > > > > Still > > > > > > > > > > > > >> > Zookeeper? > > > > > > > > > > > > >> > When will brokers enter Dual type? > > > > > > > > > > > > >> > This is unclear in the KIP. > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > Thank you. > > > > > > > > > > > > >> > Luke > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > On Thu, Oct 20, 2022 at 11:33 PM David Arthur > > > > > > > > > > > > >> > <david.art...@confluent.io.invalid> wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > Igor, thanks for taking a look! Since JBOD in > > > KRaft > > > > is > > > > > > > still > > > > > > > > > > under > > > > > > > > > > > > >> > > discussion and not likely to land before the > ZK > > > > > > > migration, I > > > > > > > > > > think > > > > > > > > > > > > >> we'll > > > > > > > > > > > > >> > > need to defer it. For migrating JBOD clusters > > from > > > > ZK > > > > > to > > > > > > > > > KRaft, > > > > > > > > > > > > we'll > > > > > > > > > > > > >> > also > > > > > > > > > > > > >> > > need to address the log dir failure mechanism > > > which > > > > > > > > currently > > > > > > > > > > > uses a > > > > > > > > > > > > >> > > special ZNode written to by the brokers. There > > is > > > an > > > > > old > > > > > > > KIP > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller > > > > > > > > > > > > >> > > which proposes a new RPC to move that ZK write > > to > > > > the > > > > > > > > > > controller, > > > > > > > > > > > > but > > > > > > > > > > > > >> I'm > > > > > > > > > > > > >> > > not sure if that's the approach we'll want to > > > take. > > > > I > > > > > > read > > > > > > > > > > through > > > > > > > > > > > > the > > > > > > > > > > > > >> > > discussion over in KIP-858 and it sounds like > > > there > > > > > are > > > > > > > some > > > > > > > > > > good > > > > > > > > > > > > >> ideas > > > > > > > > > > > > >> > > there. > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > To answer your question more directly, > migrating > > > ZK > > > > > > > clusters > > > > > > > > > > using > > > > > > > > > > > > >> JBOD > > > > > > > > > > > > >> > is > > > > > > > > > > > > >> > > out of scope for this KIP. It might be > possible > > to > > > > > stop > > > > > > > > using > > > > > > > > > > JBOD > > > > > > > > > > > > >> using > > > > > > > > > > > > >> > > reassignments, but I'm not sure about that. > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > -David > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > On Tue, Oct 18, 2022 at 12:17 PM Igor Soarez < > > > > > > i...@soarez.me > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > Hi David, > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > Thanks for the KIP, this is very exciting! > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > How does JBOD relate to this work? KRaft > mode > > > > > doesn't > > > > > > > yet > > > > > > > > > > > support > > > > > > > > > > > > >> > > > configuring brokers with multiple log > > > directories. > > > > > If > > > > > > > the > > > > > > > > > > > brokers > > > > > > > > > > > > in > > > > > > > > > > > > >> > the > > > > > > > > > > > > >> > > > existing cluster are configured with > multiple > > > log > > > > > > dirs, > > > > > > > > does > > > > > > > > > > the > > > > > > > > > > > > >> > > migration > > > > > > > > > > > > >> > > > imply that the existing brokers need to drop > > use > > > > of > > > > > > that > > > > > > > > > > > feature? > > > > > > > > > > > > >> Or is > > > > > > > > > > > > >> > > > there some way to upgrade them later? > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > Thanks, > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > -- > > > > > > > > > > > > >> > > > Igor > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > On Mon, Oct 17, 2022, at 10:07 PM, David > > Arthur > > > > > wrote: > > > > > > > > > > > > >> > > > > I've updated the KIP with the following > > > changes > > > > > (the > > > > > > > > > > > confluence > > > > > > > > > > > > >> diff > > > > > > > > > > > > >> > is > > > > > > > > > > > > >> > > > not > > > > > > > > > > > > >> > > > > helpful here since i rearranged some > things) > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > * Added ZooKeeperBlockingKRaftMillis > metric > > > > > > > > > > > > >> > > > > * Added section on new broker registration > > > JSON > > > > > > > > > > > > >> > > > > * Removed section on MigrationCheck RPC > > > > > > > > > > > > >> > > > > * Added change to UpdateMetadataRequest > > > > > > > > > > > > >> > > > > * Added section "Additional ZK Broker > > Configs" > > > > > > > (includes > > > > > > > > > > > configs > > > > > > > > > > > > >> to > > > > > > > > > > > > >> > > > connect > > > > > > > > > > > > >> > > > > to KRaft quorum) > > > > > > > > > > > > >> > > > > * Added section on "Incompatible Brokers" > > > under > > > > > > > Failure > > > > > > > > > > Modes > > > > > > > > > > > > >> > > > > * Clarified many things per this > discussion > > > > thread > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > I realized we need the KRaft controller to > > > pick > > > > > the > > > > > > > > > correct > > > > > > > > > > > > >> > > > > "metadata.version" when initializing the > > > > > migration. > > > > > > I > > > > > > > > > > included > > > > > > > > > > > > the > > > > > > > > > > > > >> > IBP > > > > > > > > > > > > >> > > > of a > > > > > > > > > > > > >> > > > > broker in its registration data so the > KRaft > > > > > > > controller > > > > > > > > > can > > > > > > > > > > > > verify > > > > > > > > > > > > >> > the > > > > > > > > > > > > >> > > > IBP > > > > > > > > > > > > >> > > > > and pick the correct "metadata.version" > when > > > > > > starting > > > > > > > > the > > > > > > > > > > > > >> migration. > > > > > > > > > > > > >> > > > > Otherwise, we could inadvertently > downgrade > > > the > > > > > > > > > > > > >> IBP/metadata.version > > > > > > > > > > > > >> > as > > > > > > > > > > > > >> > > > > part of the migration. > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > I also added "clusterId" to the broker > > > > > registration > > > > > > > data > > > > > > > > > so > > > > > > > > > > > the > > > > > > > > > > > > >> KRaft > > > > > > > > > > > > >> > > > could > > > > > > > > > > > > >> > > > > verify it. > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > -David > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > On Mon, Oct 17, 2022 at 12:14 PM Colin > > McCabe > > > < > > > > > > > > > > > > cmcc...@apache.org > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > >> On Fri, Oct 14, 2022, at 11:39, Jun Rao > > > wrote: > > > > > > > > > > > > >> > > > >> > Hi, Colin, > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > > >> > > > >> > 10. "That all goes away in the new > mode, > > > and > > > > we > > > > > > > just > > > > > > > > > have > > > > > > > > > > > > some > > > > > > > > > > > > >> > code > > > > > > > > > > > > >> > > > which > > > > > > > > > > > > >> > > > >> > analyzes __cluster_metadata and > reflects > > it > > > > in > > > > > 1) > > > > > > > > > updates > > > > > > > > > > > to > > > > > > > > > > > > ZK > > > > > > > > > > > > >> > and > > > > > > > > > > > > >> > > 2) > > > > > > > > > > > > >> > > > >> > messages sent out to brokers." > > > > > > > > > > > > >> > > > >> > Hmm, I am not sure it's that simple. > Some > > > of > > > > > the > > > > > > > > > > complexity > > > > > > > > > > > > of > > > > > > > > > > > > >> the > > > > > > > > > > > > >> > > > >> ZK-based > > > > > > > > > > > > >> > > > >> > controller are (1) using the pipelining > > > > > approach > > > > > > to > > > > > > > > > write > > > > > > > > > > > to > > > > > > > > > > > > ZK > > > > > > > > > > > > >> > for > > > > > > > > > > > > >> > > > >> better > > > > > > > > > > > > >> > > > >> > throughput and using conditional writes > > for > > > > > > > > > correctness; > > > > > > > > > > > (2) > > > > > > > > > > > > >> > sending > > > > > > > > > > > > >> > > > the > > > > > > > > > > > > >> > > > >> > proper LeaderAndIsr and UpdateMetadata > > > > > requests. > > > > > > > For > > > > > > > > > > > example, > > > > > > > > > > > > >> > during > > > > > > > > > > > > >> > > > >> > controller failover, the full metadata > > > needs > > > > to > > > > > > be > > > > > > > > sent > > > > > > > > > > > while > > > > > > > > > > > > >> > during > > > > > > > > > > > > >> > > > >> > individual broker failure, only some of > > the > > > > > > > metadata > > > > > > > > > > needs > > > > > > > > > > > to > > > > > > > > > > > > >> be > > > > > > > > > > > > >> > > > updated. > > > > > > > > > > > > >> > > > >> > The controlled shutdown handling > > sometimes > > > > uses > > > > > > > > > > > > >> StopReplicaRequest > > > > > > > > > > > > >> > > > and > > > > > > > > > > > > >> > > > >> > some other times uses > > LeaderAndIsrRequest. > > > > (3) > > > > > > > > > triggering > > > > > > > > > > > new > > > > > > > > > > > > >> > events > > > > > > > > > > > > >> > > > >> based > > > > > > > > > > > > >> > > > >> > on the responses of LeaderAndIsr (e.g. > > for > > > > > topic > > > > > > > > > > deletion). > > > > > > > > > > > > >> Some > > > > > > > > > > > > >> > of > > > > > > > > > > > > >> > > > those > > > > > > > > > > > > >> > > > >> > complexity could be re-implemented in a > > > more > > > > > > > > efficient > > > > > > > > > > way, > > > > > > > > > > > > >> but we > > > > > > > > > > > > >> > > > need > > > > > > > > > > > > >> > > > >> to > > > > > > > > > > > > >> > > > >> > be really careful not to generate > > > regression. > > > > > > Some > > > > > > > of > > > > > > > > > the > > > > > > > > > > > > other > > > > > > > > > > > > >> > > > >> complexity > > > > > > > > > > > > >> > > > >> > just won't go away. Reimplementing all > > > those > > > > > > logic > > > > > > > > for > > > > > > > > > > the > > > > > > > > > > > 30 > > > > > > > > > > > > >> or > > > > > > > > > > > > >> > so > > > > > > > > > > > > >> > > > >> events > > > > > > > > > > > > >> > > > >> > in the ZK-based controller is possible, > > but > > > > > > seems a > > > > > > > > bit > > > > > > > > > > > > >> daunting > > > > > > > > > > > > >> > and > > > > > > > > > > > > >> > > > >> risky. > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> Hi Jun, > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> Thanks for the response. I agree that > > there > > > is > > > > > > some > > > > > > > > work > > > > > > > > > > > here > > > > > > > > > > > > >> but I > > > > > > > > > > > > >> > > > don't > > > > > > > > > > > > >> > > > >> think it's as bad as it might seem. Most > of > > > the > > > > > > code > > > > > > > > for > > > > > > > > > > > > writing > > > > > > > > > > > > >> to > > > > > > > > > > > > >> > ZK > > > > > > > > > > > > >> > > > can > > > > > > > > > > > > >> > > > >> be reused from the old controller. We > > should > > > at > > > > > > least > > > > > > > > > > > evaluate > > > > > > > > > > > > >> using > > > > > > > > > > > > >> > > the > > > > > > > > > > > > >> > > > >> old ControllerChannelManager as well. I > > don't > > > > > know > > > > > > if > > > > > > > > > we'll > > > > > > > > > > > end > > > > > > > > > > > > >> up > > > > > > > > > > > > >> > > > doing it > > > > > > > > > > > > >> > > > >> or not but it's a possibility. (The main > > > reason > > > > > to > > > > > > > not > > > > > > > > do > > > > > > > > > > it > > > > > > > > > > > is > > > > > > > > > > > > >> that > > > > > > > > > > > > >> > > the > > > > > > > > > > > > >> > > > >> response handling will be a bit > different) > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> The question of what to do during a > > > controller > > > > > > > failover > > > > > > > > > is > > > > > > > > > > an > > > > > > > > > > > > >> > > > interesting > > > > > > > > > > > > >> > > > >> one. Technically, we should be able to > > > continue > > > > > > > sending > > > > > > > > > > > > >> incremental > > > > > > > > > > > > >> > > > updates > > > > > > > > > > > > >> > > > >> to the legacy nodes, for the same reason > we > > > can > > > > > in > > > > > > > > KRaft > > > > > > > > > > > mode. > > > > > > > > > > > > >> > > However, > > > > > > > > > > > > >> > > > >> probably we should just copy what ZK mode > > > does > > > > > and > > > > > > > send > > > > > > > > > > them > > > > > > > > > > > a > > > > > > > > > > > > >> full > > > > > > > > > > > > >> > > > >> metadata update. This will allow us to > have > > > an > > > > > easy > > > > > > > way > > > > > > > > > to > > > > > > > > > > > > handle > > > > > > > > > > > > >> > > > >> divergences caused by lost RPCs by > changing > > > the > > > > > > > > > controller > > > > > > > > > > > > (just > > > > > > > > > > > > >> as > > > > > > > > > > > > >> > we > > > > > > > > > > > > >> > > > do > > > > > > > > > > > > >> > > > >> in ZK mode, unfortunately). I suppose we > > > should > > > > > > > > document > > > > > > > > > > this > > > > > > > > > > > > in > > > > > > > > > > > > >> the > > > > > > > > > > > > >> > > > KIP... > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> I agree controlled shutdown is tricky to > > get > > > > just > > > > > > > > right. > > > > > > > > > I > > > > > > > > > > > > >> suppose > > > > > > > > > > > > >> > > this > > > > > > > > > > > > >> > > > is > > > > > > > > > > > > >> > > > >> a case where the RPCs we send out are not > > > > purely > > > > > > > "fire > > > > > > > > > and > > > > > > > > > > > > >> forget"; > > > > > > > > > > > > >> > we > > > > > > > > > > > > >> > > > have > > > > > > > > > > > > >> > > > >> to listen for the response. But that can > be > > > > done > > > > > in > > > > > > > an > > > > > > > > > > > > >> event-based > > > > > > > > > > > > >> > > way. > > > > > > > > > > > > >> > > > >> Controlled shutdown is also probably the > > last > > > > > thing > > > > > > > > we'll > > > > > > > > > > > > >> implement > > > > > > > > > > > > >> > > > once we > > > > > > > > > > > > >> > > > >> have the basic lift and shift. > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> Topic deletion is another annoying > thing. I > > > > > wonder > > > > > > if > > > > > > > > we > > > > > > > > > > > could > > > > > > > > > > > > >> just > > > > > > > > > > > > >> > > > delete > > > > > > > > > > > > >> > > > >> immediately here and skip the complexity > of > > > > > > > > implementing > > > > > > > > > > > > >> "deleting > > > > > > > > > > > > >> > > > state." > > > > > > > > > > > > >> > > > >> Topic IDs will exist in IBP 3.4, in both > ZK > > > and > > > > > > KRaft > > > > > > > > > mode, > > > > > > > > > > > so > > > > > > > > > > > > it > > > > > > > > > > > > >> > > > should be > > > > > > > > > > > > >> > > > >> OK, right? > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> best, > > > > > > > > > > > > >> > > > >> Colin > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > > >> > > > >> > Thanks, > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > > >> > > > >> > Jun > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > > >> > > > >> > On Fri, Oct 14, 2022 at 9:29 AM Colin > > > McCabe > > > > < > > > > > > > > > > > > >> cmcc...@apache.org> > > > > > > > > > > > > >> > > > wrote: > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > > >> > > > >> >> On Thu, Oct 13, 2022, at 11:44, Jun > Rao > > > > wrote: > > > > > > > > > > > > >> > > > >> >> > Hi, Colin, > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > Thanks for the reply. > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > 10. This is a bit on the > > implementation > > > > > side. > > > > > > If > > > > > > > > you > > > > > > > > > > > look > > > > > > > > > > > > at > > > > > > > > > > > > >> > the > > > > > > > > > > > > >> > > > >> existing > > > > > > > > > > > > >> > > > >> >> > ZK-based controller, most of the > logic > > > is > > > > > > around > > > > > > > > > > > > >> maintaining an > > > > > > > > > > > > >> > > > >> in-memory > > > > > > > > > > > > >> > > > >> >> > state of all the resources (broker, > > > topic, > > > > > > > > > partition, > > > > > > > > > > > > etc), > > > > > > > > > > > > >> > > > >> >> reading/writing > > > > > > > > > > > > >> > > > >> >> > to ZK, sending LeaderAndIsr and > > > > > UpdateMetadata > > > > > > > > > > requests > > > > > > > > > > > > and > > > > > > > > > > > > >> > > > handling > > > > > > > > > > > > >> > > > >> the > > > > > > > > > > > > >> > > > >> >> > responses to brokers. So we need all > > > that > > > > > > logic > > > > > > > in > > > > > > > > > the > > > > > > > > > > > > dual > > > > > > > > > > > > >> > write > > > > > > > > > > > > >> > > > >> mode. > > > > > > > > > > > > >> > > > >> >> One > > > > > > > > > > > > >> > > > >> >> > option is to duplicate all that > logic > > in > > > > > some > > > > > > > new > > > > > > > > > > code. > > > > > > > > > > > > This > > > > > > > > > > > > >> > can > > > > > > > > > > > > >> > > > be a > > > > > > > > > > > > >> > > > >> bit > > > > > > > > > > > > >> > > > >> >> > error prone and makes the code a bit > > > > harder > > > > > to > > > > > > > > > > maintain > > > > > > > > > > > if > > > > > > > > > > > > >> we > > > > > > > > > > > > >> > > need > > > > > > > > > > > > >> > > > to > > > > > > > > > > > > >> > > > >> fix > > > > > > > > > > > > >> > > > >> >> > some critical issues in ZK-based > > > > > controllers. > > > > > > > > > Another > > > > > > > > > > > > >> option is > > > > > > > > > > > > >> > > to > > > > > > > > > > > > >> > > > try > > > > > > > > > > > > >> > > > >> >> > reusing the existing code in the > > > ZK-based > > > > > > > > > controller. > > > > > > > > > > > For > > > > > > > > > > > > >> > > example, > > > > > > > > > > > > >> > > > we > > > > > > > > > > > > >> > > > >> >> could > > > > > > > > > > > > >> > > > >> >> > start the EventManager in the > ZK-based > > > > > > > controller, > > > > > > > > > but > > > > > > > > > > > let > > > > > > > > > > > > >> the > > > > > > > > > > > > >> > > > KRaft > > > > > > > > > > > > >> > > > >> >> > controller ingest new events. This > has > > > its > > > > > own > > > > > > > > > > > challenges: > > > > > > > > > > > > >> (1) > > > > > > > > > > > > >> > > the > > > > > > > > > > > > >> > > > >> >> existing > > > > > > > > > > > > >> > > > >> >> > logic only logs ZK failures and > > doesn't > > > > > expose > > > > > > > > them > > > > > > > > > to > > > > > > > > > > > the > > > > > > > > > > > > >> > > caller; > > > > > > > > > > > > >> > > > (2) > > > > > > > > > > > > >> > > > >> >> the > > > > > > > > > > > > >> > > > >> >> > existing logic may add new events to > > the > > > > > queue > > > > > > > > > itself > > > > > > > > > > > and > > > > > > > > > > > > we > > > > > > > > > > > > >> > > > probably > > > > > > > > > > > > >> > > > >> >> need > > > > > > > > > > > > >> > > > >> >> > to think through how this is > > coordinated > > > > > with > > > > > > > the > > > > > > > > > > KRaft > > > > > > > > > > > > >> > > controller; > > > > > > > > > > > > >> > > > >> (3) > > > > > > > > > > > > >> > > > >> >> it > > > > > > > > > > > > >> > > > >> >> > registers some ZK listeners > > > unnecessarily > > > > > (may > > > > > > > not > > > > > > > > > be > > > > > > > > > > a > > > > > > > > > > > > big > > > > > > > > > > > > >> > > > concern). > > > > > > > > > > > > >> > > > >> So > > > > > > > > > > > > >> > > > >> >> we > > > > > > > > > > > > >> > > > >> >> > need to get around those issues > > > somehow. I > > > > > am > > > > > > > > > > wondering > > > > > > > > > > > if > > > > > > > > > > > > >> we > > > > > > > > > > > > >> > > have > > > > > > > > > > > > >> > > > >> >> > considered both options and which > > > approach > > > > > we > > > > > > > are > > > > > > > > > > > leaning > > > > > > > > > > > > >> > towards > > > > > > > > > > > > >> > > > for > > > > > > > > > > > > >> > > > >> the > > > > > > > > > > > > >> > > > >> >> > implementation. > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> >> Yes, this is a good question. My take > is > > > > that > > > > > a > > > > > > > big > > > > > > > > > part > > > > > > > > > > > of > > > > > > > > > > > > >> the > > > > > > > > > > > > >> > > > >> complexity > > > > > > > > > > > > >> > > > >> >> in the old controller code results > from > > > the > > > > > fact > > > > > > > > that > > > > > > > > > we > > > > > > > > > > > use > > > > > > > > > > > > >> ZK > > > > > > > > > > > > >> > as > > > > > > > > > > > > >> > > a > > > > > > > > > > > > >> > > > >> >> multi-writer database for propagating > > > > > > information > > > > > > > > > > between > > > > > > > > > > > > >> > different > > > > > > > > > > > > >> > > > >> >> components. So in the old controller, > > > every > > > > > > write > > > > > > > to > > > > > > > > > ZK > > > > > > > > > > > > needs > > > > > > > > > > > > >> to > > > > > > > > > > > > >> > be > > > > > > > > > > > > >> > > > >> >> structured as a compare and swap to be > > > fully > > > > > > > > correct. > > > > > > > > > > > Every > > > > > > > > > > > > >> time > > > > > > > > > > > > >> > we > > > > > > > > > > > > >> > > > get > > > > > > > > > > > > >> > > > >> >> notified about something, it's usually > > in > > > > the > > > > > > form > > > > > > > > of > > > > > > > > > > > "this > > > > > > > > > > > > >> znode > > > > > > > > > > > > >> > > > >> changed" > > > > > > > > > > > > >> > > > >> >> which prompts a full reload of part of > > the > > > > > data > > > > > > in > > > > > > > > ZK > > > > > > > > > > > (which > > > > > > > > > > > > >> > itself > > > > > > > > > > > > >> > > > has > > > > > > > > > > > > >> > > > >> >> multiple parts, loading, > deserializing, > > > > > > > reconciling, > > > > > > > > > > etc.) > > > > > > > > > > > > >> That > > > > > > > > > > > > >> > all > > > > > > > > > > > > >> > > > goes > > > > > > > > > > > > >> > > > >> >> away in the new mode, and we just have > > > some > > > > > code > > > > > > > > which > > > > > > > > > > > > >> analyzes > > > > > > > > > > > > >> > > > >> >> __cluster_metadata and reflects it in > 1) > > > > > updates > > > > > > > to > > > > > > > > ZK > > > > > > > > > > and > > > > > > > > > > > > 2) > > > > > > > > > > > > >> > > > messages > > > > > > > > > > > > >> > > > >> sent > > > > > > > > > > > > >> > > > >> >> out to brokers. > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> >> This is pretty decoupled from the > other > > > > logic > > > > > in > > > > > > > > > > > > >> QuorumController > > > > > > > > > > > > >> > > and > > > > > > > > > > > > >> > > > >> >> should be easy to unit test, since the > > > same > > > > > > inputs > > > > > > > > > from > > > > > > > > > > > the > > > > > > > > > > > > >> log > > > > > > > > > > > > >> > > > always > > > > > > > > > > > > >> > > > >> >> produce the same output in ZK. > > Basically, > > > ZK > > > > > is > > > > > > > > > > write-only > > > > > > > > > > > > for > > > > > > > > > > > > >> > us, > > > > > > > > > > > > >> > > > we do > > > > > > > > > > > > >> > > > >> >> not read it (with the big exception of > > > > broker > > > > > > > > > > registration > > > > > > > > > > > > >> > znodes) > > > > > > > > > > > > >> > > > and I > > > > > > > > > > > > >> > > > >> >> think that will greatly simplify > things. > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> >> So I think dual-write mode as > described > > > here > > > > > > will > > > > > > > be > > > > > > > > > > > > >> > substantially > > > > > > > > > > > > >> > > > >> simpler > > > > > > > > > > > > >> > > > >> >> than trying to run part or all of the > > old > > > > > > > controller > > > > > > > > > in > > > > > > > > > > > > >> > parallel. I > > > > > > > > > > > > >> > > > do > > > > > > > > > > > > >> > > > >> >> think we will reuse a bunch of the > > > > > > serialization / > > > > > > > > > > > > >> > deserialization > > > > > > > > > > > > >> > > > code > > > > > > > > > > > > >> > > > >> for > > > > > > > > > > > > >> > > > >> >> znodes and possibly the code for > > > > communicating > > > > > > > with > > > > > > > > > ZK. > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> >> best, > > > > > > > > > > > > >> > > > >> >> Colin > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > 14. Good point and make sense. > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > Thanks, > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > Jun > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> > On Wed, Oct 12, 2022 at 3:27 PM > Colin > > > > > McCabe < > > > > > > > > > > > > >> > cmcc...@apache.org > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > >> wrote: > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > > >> > > > >> >> >> Hi Jun, > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> Thanks for taking a look. I can > > answer > > > > some > > > > > > > > > questions > > > > > > > > > > > > here > > > > > > > > > > > > >> > > > because I > > > > > > > > > > > > >> > > > >> >> >> collaborated on this a bit, and > David > > > is > > > > on > > > > > > > > > vacation > > > > > > > > > > > for > > > > > > > > > > > > a > > > > > > > > > > > > >> few > > > > > > > > > > > > >> > > > days. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> On Wed, Oct 12, 2022, at 14:41, Jun > > Rao > > > > > > wrote: > > > > > > > > > > > > >> > > > >> >> >> > Hi, David, > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > Thanks for the KIP. A few > comments > > > > below. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > 10. It's still not very clear to > me > > > how > > > > > the > > > > > > > > KRaft > > > > > > > > > > > > >> controller > > > > > > > > > > > > >> > > > works > > > > > > > > > > > > >> > > > >> in > > > > > > > > > > > > >> > > > >> >> the > > > > > > > > > > > > >> > > > >> >> >> > dual writes mode to KRaft log and > > ZK > > > > when > > > > > > the > > > > > > > > > > brokers > > > > > > > > > > > > >> still > > > > > > > > > > > > >> > > run > > > > > > > > > > > > >> > > > in > > > > > > > > > > > > >> > > > >> ZK > > > > > > > > > > > > >> > > > >> >> >> mode. > > > > > > > > > > > > >> > > > >> >> >> > Does the KRaft controller run a > ZK > > > > based > > > > > > > > > controller > > > > > > > > > > > in > > > > > > > > > > > > >> > > parallel > > > > > > > > > > > > >> > > > or > > > > > > > > > > > > >> > > > >> do > > > > > > > > > > > > >> > > > >> >> we > > > > > > > > > > > > >> > > > >> >> >> > derive what needs to be written > to > > ZK > > > > > based > > > > > > > on > > > > > > > > > > KRaft > > > > > > > > > > > > >> > > controller > > > > > > > > > > > > >> > > > >> logic? > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> We derive what needs to be written > to > > > ZK > > > > > > based > > > > > > > on > > > > > > > > > > KRaft > > > > > > > > > > > > >> > > controller > > > > > > > > > > > > >> > > > >> >> logic. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > I am also not sure how the KRaft > > > > > controller > > > > > > > > > handles > > > > > > > > > > > > >> broker > > > > > > > > > > > > >> > > > >> >> >> > registration/deregistration, > since > > > > > brokers > > > > > > > are > > > > > > > > > > still > > > > > > > > > > > > >> running > > > > > > > > > > > > >> > > in > > > > > > > > > > > > >> > > > ZK > > > > > > > > > > > > >> > > > >> >> mode > > > > > > > > > > > > >> > > > >> >> >> and > > > > > > > > > > > > >> > > > >> >> >> > are not heartbeating to the KRaft > > > > > > controller. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> The new controller will listen for > > > broker > > > > > > > > > > registrations > > > > > > > > > > > > >> under > > > > > > > > > > > > >> > > > >> /brokers. > > > > > > > > > > > > >> > > > >> >> >> This is the only znode watch that > the > > > new > > > > > > > > > controller > > > > > > > > > > > will > > > > > > > > > > > > >> do. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> We did consider changing how > ZK-based > > > > > broker > > > > > > > > > > > registration > > > > > > > > > > > > >> > > worked, > > > > > > > > > > > > >> > > > >> but it > > > > > > > > > > > > >> > > > >> >> >> just ended up being too much work > for > > > not > > > > > > > enough > > > > > > > > > > gain. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > 12. "A new set of nodes will be > > > > > provisioned > > > > > > > to > > > > > > > > > host > > > > > > > > > > > the > > > > > > > > > > > > >> > > > controller > > > > > > > > > > > > >> > > > >> >> >> quorum." > > > > > > > > > > > > >> > > > >> >> >> > I guess we don't support starting > > the > > > > > KRaft > > > > > > > > > > > controller > > > > > > > > > > > > >> > quorum > > > > > > > > > > > > >> > > on > > > > > > > > > > > > >> > > > >> >> existing > > > > > > > > > > > > >> > > > >> >> >> > brokers. It would be useful to > make > > > > that > > > > > > > clear. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> Agreed > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > 13. "Once the quorum is > established > > > > and a > > > > > > > > leader > > > > > > > > > is > > > > > > > > > > > > >> elected, > > > > > > > > > > > > >> > > the > > > > > > > > > > > > >> > > > >> >> >> controller > > > > > > > > > > > > >> > > > >> >> >> > will check the state of the > cluster > > > > using > > > > > > the > > > > > > > > > > > > >> MigrationCheck > > > > > > > > > > > > >> > > > RPC." > > > > > > > > > > > > >> > > > >> How > > > > > > > > > > > > >> > > > >> >> >> does > > > > > > > > > > > > >> > > > >> >> >> > the quorum controller detect > other > > > > > brokers? > > > > > > > > Does > > > > > > > > > > the > > > > > > > > > > > > >> > > controller > > > > > > > > > > > > >> > > > >> node > > > > > > > > > > > > >> > > > >> >> need > > > > > > > > > > > > >> > > > >> >> >> > to be configured with ZK > connection > > > > > string? > > > > > > > If > > > > > > > > > so, > > > > > > > > > > it > > > > > > > > > > > > >> would > > > > > > > > > > > > >> > be > > > > > > > > > > > > >> > > > >> useful > > > > > > > > > > > > >> > > > >> >> to > > > > > > > > > > > > >> > > > >> >> >> > document the additional configs > > that > > > > the > > > > > > > quorum > > > > > > > > > > > > >> controller > > > > > > > > > > > > >> > > > needs to > > > > > > > > > > > > >> > > > >> >> set. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> Yes, the controllers monitor ZK for > > > > broker > > > > > > > > > > > registrations, > > > > > > > > > > > > >> as I > > > > > > > > > > > > >> > > > >> mentioned > > > > > > > > > > > > >> > > > >> >> >> above. So they need zk.connect and > > the > > > > > other > > > > > > ZK > > > > > > > > > > > > connection > > > > > > > > > > > > >> > > > >> >> configurations. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > 14. "In order to prevent further > > > writes > > > > > to > > > > > > > ZK, > > > > > > > > > the > > > > > > > > > > > > first > > > > > > > > > > > > >> > thing > > > > > > > > > > > > >> > > > the > > > > > > > > > > > > >> > > > >> new > > > > > > > > > > > > >> > > > >> >> >> > KRaft quorum must do is take over > > > > > > leadership > > > > > > > of > > > > > > > > > the > > > > > > > > > > > ZK > > > > > > > > > > > > >> > > > controller. > > > > > > > > > > > > >> > > > >> " > > > > > > > > > > > > >> > > > >> >> The > > > > > > > > > > > > >> > > > >> >> >> ZK > > > > > > > > > > > > >> > > > >> >> >> > controller processing changes to > > > > > > /controller > > > > > > > > > update > > > > > > > > > > > > >> > > > asynchronously. > > > > > > > > > > > > >> > > > >> >> How > > > > > > > > > > > > >> > > > >> >> >> > does the KRaft controller know > when > > > the > > > > > ZK > > > > > > > > > > controller > > > > > > > > > > > > has > > > > > > > > > > > > >> > > > resigned > > > > > > > > > > > > >> > > > >> >> before > > > > > > > > > > > > >> > > > >> >> >> > it can safely copy the ZK data? > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> This should be done through > > > > > > > > > > > > >> expectedControllerEpochZkVersion, > > > > > > > > > > > > >> > > just > > > > > > > > > > > > >> > > > >> like > > > > > > > > > > > > >> > > > >> >> in > > > > > > > > > > > > >> > > > >> >> >> ZK mode, right? We should bump this > > > epoch > > > > > > value > > > > > > > > so > > > > > > > > > > that > > > > > > > > > > > > any > > > > > > > > > > > > >> > > writes > > > > > > > > > > > > >> > > > >> from > > > > > > > > > > > > >> > > > >> >> the > > > > > > > > > > > > >> > > > >> >> >> old controller will not go > through. I > > > > agree > > > > > > we > > > > > > > > > should > > > > > > > > > > > > spell > > > > > > > > > > > > >> > this > > > > > > > > > > > > >> > > > out > > > > > > > > > > > > >> > > > >> in > > > > > > > > > > > > >> > > > >> >> the > > > > > > > > > > > > >> > > > >> >> >> KIP. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > 15. We have the following > > sentences. > > > > One > > > > > > says > > > > > > > > > > > > >> ControllerId > > > > > > > > > > > > >> > is > > > > > > > > > > > > >> > > a > > > > > > > > > > > > >> > > > >> random > > > > > > > > > > > > >> > > > >> >> >> > KRaft broker and the other says > > it's > > > > the > > > > > > > active > > > > > > > > > > > > >> controller. > > > > > > > > > > > > >> > > > Which > > > > > > > > > > > > >> > > > >> one > > > > > > > > > > > > >> > > > >> >> is > > > > > > > > > > > > >> > > > >> >> >> > correct? > > > > > > > > > > > > >> > > > >> >> >> > "UpdateMetadata: for certain > > metadata > > > > > > > changes, > > > > > > > > > the > > > > > > > > > > > > KRaft > > > > > > > > > > > > >> > > > controller > > > > > > > > > > > > >> > > > >> >> will > > > > > > > > > > > > >> > > > >> >> >> > need to send > UpdateMetadataRequests > > > to > > > > > the > > > > > > ZK > > > > > > > > > > > brokers. > > > > > > > > > > > > >> For > > > > > > > > > > > > >> > the > > > > > > > > > > > > >> > > > >> >> >> > “ControllerId” field in this > > request, > > > > the > > > > > > > > > > controller > > > > > > > > > > > > >> should > > > > > > > > > > > > >> > > > >> specify a > > > > > > > > > > > > >> > > > >> >> >> > random KRaft broker." > > > > > > > > > > > > >> > > > >> >> >> > "In the UpdateMetadataRequest > sent > > by > > > > the > > > > > > > KRaft > > > > > > > > > > > > >> controller > > > > > > > > > > > > >> > to > > > > > > > > > > > > >> > > > the > > > > > > > > > > > > >> > > > >> ZK > > > > > > > > > > > > >> > > > >> >> >> > brokers, the ControllerId will > > point > > > to > > > > > the > > > > > > > > > active > > > > > > > > > > > > >> > controller > > > > > > > > > > > > >> > > > which > > > > > > > > > > > > >> > > > >> >> will > > > > > > > > > > > > >> > > > >> >> >> be > > > > > > > > > > > > >> > > > >> >> >> > used for the inter-broker > > requests." > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> Yeah, this seems like an error to > me > > as > > > > > > well. A > > > > > > > > > > random > > > > > > > > > > > > >> value > > > > > > > > > > > > >> > is > > > > > > > > > > > > >> > > > not > > > > > > > > > > > > >> > > > >> >> really > > > > > > > > > > > > >> > > > >> >> >> useful. Plus the text here is > > > > > > > self-contradictory, > > > > > > > > > as > > > > > > > > > > > you > > > > > > > > > > > > >> > pointed > > > > > > > > > > > > >> > > > out. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> I suspect what we should do here is > > > add a > > > > > new > > > > > > > > > field, > > > > > > > > > > > > >> > > > >> KRaftControllerId, > > > > > > > > > > > > >> > > > >> >> >> and populate it with the real > > > controller > > > > > ID, > > > > > > > and > > > > > > > > > > leave > > > > > > > > > > > > the > > > > > > > > > > > > >> old > > > > > > > > > > > > >> > > > >> >> controllerId > > > > > > > > > > > > >> > > > >> >> >> field as -1. A ZK-based broker that > > > sees > > > > > this > > > > > > > can > > > > > > > > > > then > > > > > > > > > > > > >> consult > > > > > > > > > > > > >> > > its > > > > > > > > > > > > >> > > > >> >> >> controller.quorum.voters > > configuration > > > to > > > > > see > > > > > > > > where > > > > > > > > > > it > > > > > > > > > > > > >> should > > > > > > > > > > > > >> > > send > > > > > > > > > > > > >> > > > >> >> >> controller-bound RPCs. That > (static) > > > > > > > > configuration > > > > > > > > > > lets > > > > > > > > > > > > us > > > > > > > > > > > > >> map > > > > > > > > > > > > >> > > > >> between > > > > > > > > > > > > >> > > > >> >> >> controller ID and host:port. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> We should still keep our existing > > epoch > > > > > logic > > > > > > > for > > > > > > > > > > > > deciding > > > > > > > > > > > > >> > when > > > > > > > > > > > > >> > > > >> >> >> UpdateMetadataRequest / > > > > > LeaderAndIsrRequests > > > > > > > are > > > > > > > > > > stale, > > > > > > > > > > > > >> with > > > > > > > > > > > > >> > the > > > > > > > > > > > > >> > > > >> caveat > > > > > > > > > > > > >> > > > >> >> >> that any kraft-based epoch should > be > > > > > treated > > > > > > as > > > > > > > > > > greater > > > > > > > > > > > > >> than > > > > > > > > > > > > >> > any > > > > > > > > > > > > >> > > > >> >> ZK-based > > > > > > > > > > > > >> > > > >> >> >> epoch. After all, the kraft epoch > is > > > > coming > > > > > > > from > > > > > > > > > the > > > > > > > > > > > > epoch > > > > > > > > > > > > >> of > > > > > > > > > > > > >> > > > >> >> >> __cluster_metadata, whereas the ZK > > > epoch > > > > > > comes > > > > > > > > from > > > > > > > > > > ZK. > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > 16. "Additionally, the controller > > > must > > > > > > > specify > > > > > > > > > if a > > > > > > > > > > > > >> broker > > > > > > > > > > > > >> > in > > > > > > > > > > > > >> > > > >> >> >> “LiveBrokers” > > > > > > > > > > > > >> > > > >> >> >> > is KRaft or ZK." Does that > require > > > any > > > > > > > protocol > > > > > > > > > > > changes > > > > > > > > > > > > >> to > > > > > > > > > > > > >> > > > >> >> >> UpdateMetadata? > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> Yeah, I am also curious why the we > > need > > > > to > > > > > > care > > > > > > > > > > whether > > > > > > > > > > > > >> > brokers > > > > > > > > > > > > >> > > > are > > > > > > > > > > > > >> > > > >> ZK > > > > > > > > > > > > >> > > > >> >> or > > > > > > > > > > > > >> > > > >> >> >> KRaft in UpdateMetadataRequest. We > > > don't > > > > > > reveal > > > > > > > > > this > > > > > > > > > > to > > > > > > > > > > > > >> > clients, > > > > > > > > > > > > >> > > > so > > > > > > > > > > > > >> > > > >> can > > > > > > > > > > > > >> > > > >> >> we > > > > > > > > > > > > >> > > > >> >> >> just leave this out? > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> best, > > > > > > > > > > > > >> > > > >> >> >> Colin > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > Thanks, > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > Jun > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > On Wed, Oct 5, 2022 at 10:07 AM > > > Mickael > > > > > > > Maison > > > > > > > > < > > > > > > > > > > > > >> > > > >> >> mickael.mai...@gmail.com > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> > wrote: > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> Hi David, > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> >> Thanks for starting this > important > > > > KIP. > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> >> I've just taken a quick look so > > far > > > > but > > > > > > I've > > > > > > > > > got a > > > > > > > > > > > > >> couple > > > > > > > > > > > > >> > of > > > > > > > > > > > > >> > > > >> initial > > > > > > > > > > > > >> > > > >> >> >> >> questions: > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> >> 1) What happens if a non KRaft > > > > > compatible > > > > > > > > broker > > > > > > > > > > (or > > > > > > > > > > > > >> with > > > > > > > > > > > > >> > > > >> >> >> >> kafka.metadata.migration.enable > > set > > > to > > > > > > > false) > > > > > > > > > > joins > > > > > > > > > > > > the > > > > > > > > > > > > >> > > cluster > > > > > > > > > > > > >> > > > >> after > > > > > > > > > > > > >> > > > >> >> >> >> the migration is triggered? > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> >> 2) In the Failure Modes section > > you > > > > > > mention > > > > > > > a > > > > > > > > > > > scenario > > > > > > > > > > > > >> > where > > > > > > > > > > > > >> > > a > > > > > > > > > > > > >> > > > >> write > > > > > > > > > > > > >> > > > >> >> >> >> to ZK fails. What happens when > the > > > > > > > divergence > > > > > > > > > > limit > > > > > > > > > > > is > > > > > > > > > > > > >> > > > reached? Is > > > > > > > > > > > > >> > > > >> >> >> >> this a fatal condition? How much > > > > > > divergence > > > > > > > > > should > > > > > > > > > > > we > > > > > > > > > > > > >> > allow? > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> >> Thanks, > > > > > > > > > > > > >> > > > >> >> >> >> Mickael > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM > > > David > > > > > > > Arthur < > > > > > > > > > > > > >> > > mum...@gmail.com > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > >> >> wrote: > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> > Hey folks, I wanted to get the > > > ball > > > > > > > rolling > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > > > >> > > discussion > > > > > > > > > > > > >> > > > >> for > > > > > > > > > > > > >> > > > >> >> the > > > > > > > > > > > > >> > > > >> >> >> >> > ZooKeeper migration KIP. This > > KIP > > > > > > details > > > > > > > > how > > > > > > > > > we > > > > > > > > > > > > plan > > > > > > > > > > > > >> to > > > > > > > > > > > > >> > do > > > > > > > > > > > > >> > > > an > > > > > > > > > > > > >> > > > >> >> online > > > > > > > > > > > > >> > > > >> >> >> >> > migration of metadata from > > > ZooKeeper > > > > > to > > > > > > > > KRaft > > > > > > > > > as > > > > > > > > > > > > well > > > > > > > > > > > > >> as > > > > > > > > > > > > >> > a > > > > > > > > > > > > >> > > > >> rolling > > > > > > > > > > > > >> > > > >> >> >> >> > upgrade of brokers to KRaft > > mode. > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> > The general idea is to keep > > KRaft > > > > and > > > > > > > > > ZooKeeper > > > > > > > > > > in > > > > > > > > > > > > >> sync > > > > > > > > > > > > >> > > > during > > > > > > > > > > > > >> > > > >> the > > > > > > > > > > > > >> > > > >> >> >> >> > migration, so both types of > > > brokers > > > > > can > > > > > > > > exist > > > > > > > > > > > > >> > > simultaneously. > > > > > > > > > > > > >> > > > >> Then, > > > > > > > > > > > > >> > > > >> >> >> >> > once everything is migrated > and > > > > > updated, > > > > > > > we > > > > > > > > > can > > > > > > > > > > > turn > > > > > > > > > > > > >> off > > > > > > > > > > > > >> > > > >> ZooKeeper > > > > > > > > > > > > >> > > > >> >> >> >> > writes. > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> > This is a pretty complex KIP, > so > > > > > please > > > > > > > > take a > > > > > > > > > > > look > > > > > > > > > > > > :) > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > > >> > > > >> >> >> >> > Thanks! > > > > > > > > > > > > >> > > > >> >> >> >> > David > > > > > > > > > > > > >> > > > >> >> >> >> > > > > > > > > > > > > >> > > > >> >> >> > > > > > > > > > > > > >> > > > >> >> > > > > > > > > > > > > >> > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > -- > > > > > > > > > > > > >> > > > > -David > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > -- > > > > > > > > > > > > >> > > -David > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > -David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > -David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > -David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -David > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -David > > > > > > > > > > > > -- > -David >