On Fri, Nov 11, 2022, at 08:59, David Arthur wrote: > Thanks, Colin. > >> never start an upgrade without first verifying the quorum configuration on >> the ZK-based brokers > > I agree that this is a pretty big benefit. I could imagine debugging > and fixing connection problems mid-migration would be a big pain. > Especially if you had some brokers correctly configured, and others > not. > > Adding a heartbeat raises some questions about what to do if a broker > goes into a bad state, or stops heartbeating, during a migration. > However, I think the same is true for a registration based approach, > so maybe it's not an increase in net complexity. >
Hi David, Yeah. I think the goal should be for the set of heartbeaters to match the set of broker registrations under /brokers Obviously, people could add or remove brokers after the upgrade has begun, but that's unavoidable, I think. We can at least ensure that at the time we enter upgrade, all the brokers are ready. > I've replaced the ZK registration section with a new RPC and brief > description. Please take a look. > Thanks, David. With these changes it LGTM to me. +1 (binding) Colin > Thanks! > David > > On Wed, Nov 9, 2022 at 5:46 PM Colin McCabe <cmcc...@apache.org> wrote: >> >> Hi David, >> >> Thanks for the response. Replies inline. >> >> On Wed, Nov 9, 2022, at 08:17, David Arthur wrote: >> > Colin >> > >> >> Maybe zk.metadata.migration.enable ? >> > >> > Done. I went with "zookeeper.metadata.migration.enable" since our >> > other ZK configs start with "zookeeper.*" >> > >> >> SImilarly, for MigrationRecord: can we rename this to >> >> ZkMigrationStateRecord? Then change MigrationState -> ZkMigrationState. >> > >> > Sure >> > >> >> With ZkMigrationStateRecord, one thing to keep in mind here is that we >> >> will eventually compact all the metadata logs into a snapshot. That >> >> snapshot will then have to keep alive the memory of the old migration. So >> >> it is not really a matter of replaying the old metadata logs (probably) >> >> but a matter of checking to see what the ZkMigrationState is, which I >> >> suppose could be Optional<ZkMigrationState>. If it's not Optional.empty, >> >> we already migrated / are migrating. >> > >> > Yea, makes sense. >> > >> >> For the /migration ZNode, is "last_update_time_ms" necessary? I thought >> >> ZK already tracked this information in the mzxid of the znode? >> > >> > Yes, Jun pointed this out previously, I missed this update in the KIP. >> > Fixed now. >> > >> >> It is true that technically it is only needed in UMR, but I would still >> >> suggest including KRaftControllerId in LeaderAndIsrRequest because it >> >> will make debugging much easier. >> >> >> >> I would suggest not implementing the topic deletion state machine, but >> >> just deleting topics eagerly when in migration mode. We can implement >> >> this behavior change by keying off of whether KRaftControllerId is >> >> present in LeaderAndIsrRequest. On broker startup, we'll be sent a full >> >> LeaderAndIsrRequest and can delete stray partitions whose IDs are not as >> >> expected (again, this behavior change would only be for migration mode) >> > >> > Sounds good to me. Since this is somewhat of an implementation detail, >> > do you think we need this included in the KIP? >> >> Yeah, maybe we don't need to go into the delete behavior here. But I think >> the KIP should specify that we have KRaftControllerId in both >> LeaderAndIsrRequest. That will allow us to implement this behavior >> conditionally on zk-based brokers when in dual write mode. >> >> > >> >> For existing KRaft controllers, will >> >> kafka.controller:type=KafkaController,name=MigrationState show up as 4 >> >> (MigrationFinalized)? I assume this is true, but it would be good to >> >> spell it out. Sorry if this is answered somewhere else. >> > >> > We discussed using 0 (None) as the value to report for original, >> > un-migrated KRaft clusters. 4 (MigrationFinalized) would be for >> > clusters which underwent a migration. I have some description of this >> > in the table under "Migration Overview" >> > >> >> I don't feel that strongly about this, but wouldn't it be a good idea for >> MigrationState to have a different value for ZK-based clusters and >> KRaft-based clusters? If you have a bunch of clusters and you take an >> aggregate of this metric, it would be good to get a report of three numbers: >> 1. unupgraded ZK >> 2. in progress upgrades >> 3. kraft >> >> I guess we could get that from examining some other metrics too, though. Not >> sure, what do you think? >> >> >> As you point out, the ZK brokers being upgraded will need to contact the >> >> KRaft quorum in order to forward requests to there, once we are in >> >> migration mode. This raises a question: rather than changing the broker >> >> registration, can we have those brokers send an RPC to the kraft >> >> controller quorum instead? This would serve to confirm that they can >> >> reach the quorum. Then the quorum could wait for all of the brokers to >> >> check in this way before starting the migration (It would know all the >> >> brokers by looking at /brokers) >> > >> > One of the motivations I had for putting the migration details in the >> > broker registration is that it removes ordering constraints between >> > the brokers and controllers when starting the migration. If we set the >> > brokers in migration mode before the KRaft quorum is available, they >> > will just carry on until the KRaft controller takes over the >> > controller leadership and starts sending UMR/LISR. >> > >> >> I agree that any scheme that requires complicated startup ordering is not a >> good idea. I was more thinking about having the zk-based brokers >> periodically send some kind of "I'm here" upgrade heartbeat to the >> controller quorum. >> >> > I wonder if we could infer reachability of the new KRaft controller by >> > the brokers through ApiVersions? Once taking leadership, the active >> > KRaft controller could monitor its NodeApiVersions to see that each of >> > the ZK brokers had reached it. Would this be enough to verify >> > reachability? >> > >> >> If the new KRaft controller has already taken leadership, it's too late to >> do anything about brokers that can't reach it. Unless we want to implement >> automatic rollback to the pre-upgrade state, which sounds quite complex and >> error prone. >> >> So, to sum up: my proposal was that the ZK based brokers periodically (like >> maybe every 10 seconds) try to contact the quorum with an RPC containing >> their IBP and readiness state. I guess the main negative thing is that it >> would be more network traffic (but probably not significantly more.) The >> main positive thing is that we'd never start an upgrade without first >> verifying the quorum configuration on the ZK-based brokers. A neutral thing >> (neither good nor bad, I think) is that we'd have a new RPC rather than a >> znode change. >> >> best, >> Colin >> >> >> > -David >> > >> > >> > On Tue, Nov 8, 2022 at 6:15 PM Colin McCabe <cmcc...@apache.org> wrote: >> >> >> >> Hi David, >> >> >> >> Looks great. Some questions: >> >> >> >> I agree with Jun that it would be good to rename >> >> metadata.migration.enable to something more zk-specific. Maybe >> >> zk.metadata.migration.enable ? >> >> >> >> SImilarly, for MigrationRecord: can we rename this to >> >> ZkMigrationStateRecord? Then change MigrationState -> ZkMigrationState. >> >> >> >> With ZkMigrationStateRecord, one thing to keep in mind here is that we >> >> will eventually compact all the metadata logs into a snapshot. That >> >> snapshot will then have to keep alive the memory of the old migration. So >> >> it is not really a matter of replaying the old metadata logs (probably) >> >> but a matter of checking to see what the ZkMigrationState is, which I >> >> suppose could be Optional<ZkMigrationState>. If it's not Optional.empty, >> >> we already migrated / are migrating. >> >> >> >> For the /migration ZNode, is "last_update_time_ms" necessary? I thought >> >> ZK already tracked this information in the mzxid of the znode? >> >> >> >> It is true that technically it is only needed in UMR, but I would still >> >> suggest including KRaftControllerId in LeaderAndIsrRequest because it >> >> will make debugging much easier. >> >> >> >> I would suggest not implementing the topic deletion state machine, but >> >> just deleting topics eagerly when in migration mode. We can implement >> >> this behavior change by keying off of whether KRaftControllerId is >> >> present in LeaderAndIsrRequest. On broker startup, we'll be sent a full >> >> LeaderAndIsrRequest and can delete stray partitions whose IDs are not as >> >> expected (again, this behavior change would only be for migration mode) >> >> >> >> For existing KRaft controllers, will >> >> kafka.controller:type=KafkaController,name=MigrationState show up as 4 >> >> (MigrationFinalized)? I assume this is true, but it would be good to >> >> spell it out. Sorry if this is answered somewhere else. >> >> >> >> As you point out, the ZK brokers being upgraded will need to contact the >> >> KRaft quorum in order to forward requests to there, once we are in >> >> migration mode. This raises a question: rather than changing the broker >> >> registration, can we have those brokers send an RPC to the kraft >> >> controller quorum instead? This would serve to confirm that they can >> >> reach the quorum. Then the quorum could wait for all of the brokers to >> >> check in this way before starting the migration (It would know all the >> >> brokers by looking at /brokers) >> >> >> >> best, >> >> Colin >> >> >> >> On Tue, Nov 8, 2022, at 07:23, David Arthur wrote: >> >> > Hello everyone, I'd like to start the vote on KIP-866. >> >> > >> >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration >> >> > >> >> > Thanks! >> >> > David Arthur >> > >> > >> > >> > -- >> > David Arthur > > > > -- > David Arthur