Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

David Arthur Thu, 27 Oct 2022 13:01:47 -0700

Jun,

20/21. A KRaft controller will recover the migration state by reading the
"/migration" ZNode. If the migration enable config is set, and the ZK
migration is complete, it will enter the dual-write mode. Before an
operator can decommission ZK, they will need to finalize the migration
which involves removing the migration config and the ZK config. I'll
clarify this in the KIP.


22. Yea, we could see an incorrect broker ID during that window.  If we
ended up with a state where we saw a ZK broker ID that conflicted with a
KRaft broker ID, we would need to fence one of them. I would probably opt
to fence the KRaft broker in that case since broker registration and
fencing is more robust in KRaft. Hopefully this is a rare case.

26. Sounds good.

Thanks!
David


On Thu, Oct 27, 2022 at 1:34 PM Jun Rao <j...@confluent.io.invalid> wrote:

> Hi, David,
>
> Thanks for the reply. A few more comments.
>
> 20/21. Using a tagged field in ApiVersionRequest could work. Related to
> this, how does a KRaft controller know that it's in the dual write mode?
> Does it need to read the /controller path from ZK? After the migration,
> people may have the ZK cluster decommissioned, but still have the ZK
> configs left in the KRaft controller. Will this cause the KRaft controller
> to be stuck because it doesn't know which mode it is in?
>
> 22. Using the ephemeral node matches the current ZK-based broker behavior
> better. However, it leaves a window for incorrect broker registration to
> sneak in during KRaft controller failover.
>
> 26. Then, we could just remove Broker Registration in that section.
>
> Thanks,
>
> Jun
>
> On Wed, Oct 26, 2022 at 2:21 PM David Arthur
> <david.art...@confluent.io.invalid> wrote:
>
> > Jun
> >
> > 20/21 It could definitely cause problems if we failover to a controller
> > without "kafka.metadata.migration.enable". The only mechanism I know of
> for
> > controllers to learn things about one another is ApiVersions. We
> currently
> > use this for checking support for "metadata.version" (in KRaft mode). We
> > could add a "zk.migration" feature flag that's enabled on a controller
> only
> > if the config is set. Another possibility would be a tagged field on
> > ApiVersionResponse that indicated if the config was set. Both seem
> somewhat
> > inelegant. I think a tagged field would be a bit simpler (and arguably
> less
> > hacky).
> >
> > For 20, we could avoid entering the migration state unless the whole
> quorum
> > had the field present in their NodeApiVersions. For 21, we could avoid
> > leaving the migration state unless the whole quorum did not have the
> field
> > in their NodeApiVersions. Do you think this would be sufficient?
> >
> > 22. Right, we need to write the broker info back to ZK just as a
> safeguard
> > against incorrect broker IDs getting registered into ZK. I was thinking
> > these would be persistent nodes, but it's probably fine to make them
> > ephemeral and have the active KRaft controller keep them up to date.
> >
> > 23. Right. When the broker comes up as a KRaft broker, it's old
> > /brokers/ids ZNode will be gone and it will register itself with the
> KRaft
> > controller. The controller will know that it is now in KRaft mode and
> will
> > stop sending it the old RPCs.
> >
> > 24. Ok, I'll add these
> >
> > 25. I realize now the "/controller_epoch" node is already persistent. It
> > should be sufficient to remove the "/controller" node to trigger an
> > election.
> >
> > 26. Hmm, not sure, but I don't think it uses watches. KafkaServer
> registers
> > the broker info and later loads all brokers in the cluster to check that
> > the endpoints don't conflict. Things that do use watches are dynamic
> > configs, ACLs, and some others (listed in the KIP).
> >
> >
> > On Wed, Oct 26, 2022 at 4:26 PM David Arthur <david.art...@confluent.io>
> > wrote:
> >
> > > Luke and Andrew, thanks for taking a look!
> > >
> > > I think the names of the state machine in the KIP aren't the best. I'll
> > > try to improve that section in general.
> > >
> > > 1. "MigrationReady" is probably better named "MigratingFromZk" or
> > > something. It's meant to be the state when the KRaft controller is
> > actively
> > > migrating the data out of ZK. This happens after we have detected that
> > the
> > > cluster is eligible and before we enter the dual write mode.
> > >
> > > 2. "MigrationActive" should probably be called "BrokerMigration".
> > > Transitioning to "MigrationFinished" can happen automatically when all
> > the
> > > known ZK brokers have been migrated. Since we don't have permanent
> > > registrations of ZK brokers, we can use the partition assignments as a
> > > proxy for this.
> > >
> > > 3. A metric for unready brokers makes sense. We can also log a message
> on
> > > the controller when it tries to start the migration
> > >
> > > 4. The MetadataType metric is meant to help see this. E.g., some
> brokers
> > > would have MetadataType=ZK, some would have MetadataType=Dual
> > >
> > > 5. On ZK brokers, not having this config set would prevent the new
> broker
> > > registration data from being written to ZK. This means the KRaft
> > controller
> > > won't send RPCs to it. If a broker has been migrated to KRaft already,
> > I'm
> > > not sure there is any harm in removing this config. If we decide that
> we
> > > need to guarantee that KRaft brokers have this config set during the
> > > migration, we can include it in the broker registration that's sent to
> > > KRaft. This would let the controller keep that broker fenced.
> > >
> > > 6. Once the KRaft controller takes leadership, the ZK controller won't
> be
> > > active any more and will stop reporting the MigrationState metric.
> > >
> > > 7. The "Dual" MetadataType is reported by brokers running in KRaft mode
> > > when the migration enable config is set. I think the controller should
> > also
> > > report this, not just the brokers. I'll clarify in the KIP.
> > >
> > >
> > >
> > > Andrew:
> > >
> > > > How will the code get all the ZooKeeper config? Will it do some sort
> of
> > > scan of the ZooKeeper data store? ... Do we need to do anything special
> > to
> > > make sure we get all data from ZooKeeper?
> > >
> > > With a few exceptions, all data written to ZK by Kafka happens on the
> > > controller (single writer principle). In our migration code, we can
> > > unconditionally update the "/controller" and "/controller_epoch" ZNodes
> > and
> > > effectively force the migration component to gain leadership of the ZK
> > > controller. Once this happens, we don't expect any data to be written
> to
> > > ZK, so we can read it iteratively without worrying about any concurrent
> > > writes.
> > >
> > > As for the linearizable thing, I believe this just means that reads may
> > be
> > > served by a quorum follower which has stale data. We'll be reading from
> > ZK
> > > the same way the ZK controller does, so I think we will be fine
> regarding
> > > consistency. If we wanted to be extra careful, we could add a delay
> prior
> > > to iterating through the znodes to give partitioned ZK followers a
> chance
> > > to get kicked out of the quorum. I don't think we'll need that though
> > >
> > > > What happens if we commit the transaction then fail right after that
> > and
> > > restart?
> > >
> > > If we commit the "migration" transaction to the metadata log, it won't
> be
> > > a problem if we failover or restart. We can recover our state based on
> > the
> > > metadata log and what exists in ZK. If we see a migration transaction
> in
> > > the kraft log, but no "/migration" ZNode, we'll know we failed before
> > > writing to ZK. If we see a partial transaction in the log, then we can
> > > abort it and restart the migration.
> > >
> > > > This sort of leads to me wondering if we will/should check the
> metadata
> > > log state before doing the migration? That could also catch mistakes
> such
> > > as if a KRaft quorum is used that already had some metadata which I
> > assume
> > > we want to prevent.
> > >
> > > Yes, we should check that the metadata log is empty before attempting a
> > > migration (perhaps excepting the bootstrap metadata -- need to think on
> > > that one still).
> > >
> > >
> > >
> > >
> > > On Mon, Oct 24, 2022 at 8:57 AM Andrew Grant
> <agr...@confluent.io.invalid
> > >
> > > wrote:
> > >
> > >> Hey David,
> > >>
> > >>
> > >> Thanks for the KIP. I had a few small questions.
> > >>
> > >>
> > >> “The ZK data migration will copy the existing ZK data into the KRaft
> > >> metadata log and establish the new KRaft active controller as the
> active
> > >> controller from a ZK perspective.”
> > >>
> > >> How will the code get all the ZooKeeper config? Will it do some sort
> of
> > >> scan of the ZooKeeper data store? Also I’m not a ZooKeeper expert but
> > per
> > >> https://zookeeper.apache.org/doc/current/zookeeperInternals.html
> “Read
> > >> operations in ZooKeeper are *not linearizable* since they can return
> > >> potentially stale data.” Do we need to do anything special to make
> sure
> > we
> > >> get all data from ZooKeeper?
> > >>
> > >>
> > >> “For the initial migration, the controller will utilize KIP-868
> Metadata
> > >> Transactions
> > >> <
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-868+Metadata+Transactions
> > >> >
> > >> to
> > >> write all of the ZK metadata in a single transaction. If the
> controller
> > >> fails before this transaction is finalized, the next active controller
> > >> will
> > >> abort the transaction and restart the migration process.” What happens
> > if
> > >> we commit the transaction then fail right after that and restart? This
> > >> sort
> > >> of leads to me wondering if we will/should check the metadata log
> state
> > >> before doing the migration? That could also catch mistakes such as if
> a
> > >> KRaft quorum is used that already had some metadata which I assume we
> > want
> > >> to prevent.
> > >>
> > >>
> > >> For the Test Plan section do you think it’s worth calling out
> > performance
> > >> testing migrations of ZooKeeper deployments that have “large” amounts
> of
> > >> metadata?
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> Andrew
> > >>
> > >> On Mon, Oct 24, 2022 at 3:20 AM Luke Chen <show...@gmail.com> wrote:
> > >>
> > >> > Hi David,
> > >> >
> > >> > Thanks for the good and complicated proposal! :)
> > >> > Some questions:
> > >> >
> > >> > 1. "MigrationReady" state: The KIP said:
> > >> > The KRaft quorum has been started
> > >> > I don't think we'll automatically enter this state after KRaft
> quorum
> > >> > started, do we?
> > >> > I think KRaft active controller should take over leadership by
> > writing a
> > >> > znode in /controller path before we entering this sate, is that
> > correct?
> > >> >
> > >> > 2. "MigrationActive" state: the KIP said:
> > >> > ZK state has been migrated, controller is in dual-write mode,
> brokers
> > >> are
> > >> > being restarted in KRaft mode
> > >> > What confuses me is the last part: "brokers are being restarted in
> > KRaft
> > >> > mode".
> > >> > How could we detect brokers are being restarted in KRaft mode? Old
> ZK
> > >> > broker is removed and new KRaft broker is up within N minutes?
> > >> > I think we don't have to rely on the condition "brokers are being
> > >> restarted
> > >> > in KRaft mode" to enter this state.
> > >> > "brokers are being restarted in KRaft mode" should be a process
> > happened
> > >> > between "MigrationActive" and "MigrationFinished". Does that make
> > sense?
> > >> >
> > >> > 3. When "Zookeeper" mode trying to enter "MigrationEligible", if
> there
> > >> is a
> > >> > cluster with tens of brokers, how could users know why this cluster
> > >> cannot
> > >> > be in "MigrationEligible" state, yet? Check that znode manually one
> by
> > >> one?
> > >> > Or do we plan to create a tool to help them? Or maybe expose the
> > >> "unready
> > >> > ZK brokers" metrics?
> > >> >
> > >> > 4. Same for "MigrationActive" entering "MigrationFinished" state.
> > Since
> > >> > that will be some process for restarting ZK broker into KRaft broker
> > >> one by
> > >> > one, could we expose the "remaining ZK brokers" as metrics?
> > >> >
> > >> > 5. When users are in
> > >> "MigrationReady"/"MigrationActive"/"MigrationFinished"
> > >> > states, and they accidentally change the KRaft controller config:
> > >> > "kafka.metadata.migration.enable"
> > >> > to false. What will happen to this cluster? No dual-write for it?
> Will
> > >> we
> > >> > have any protection for it?
> > >> >
> > >> > 6. About the "MigrationState" metric:
> > >> > The "ZooKeeper" and "MigrationEligible" is reported by ZK
> controller,
> > >> and
> > >> > the rest of states are reported by KRaft controller -->  makes sense
> > to
> > >> me.
> > >> > One question from it is, when KRaft controller takes over the
> > leadership
> > >> > from ZK controller, what will the "MigrationState" value in old ZK
> > >> > controller? keep in "MigrationEligible" doesn't make sense. Will
> there
> > >> be a
> > >> > empty or null state?
> > >> >
> > >> > 7. About the "MetadataType" metric:
> > >> > An enumeration of: ZooKeeper (1), Dual (2), KRaft (3). Each broker
> > >> reports
> > >> > this.
> > >> > I don't know how we could map the migration state to these 3 types.
> > >> > What is the metadataType when cluster in "MigrationReady" state?
> Still
> > >> > Zookeeper?
> > >> > When will brokers enter Dual type?
> > >> > This is unclear in the KIP.
> > >> >
> > >> > Thank you.
> > >> > Luke
> > >> >
> > >> > On Thu, Oct 20, 2022 at 11:33 PM David Arthur
> > >> > <david.art...@confluent.io.invalid> wrote:
> > >> >
> > >> > > Igor, thanks for taking a look! Since JBOD in KRaft is still under
> > >> > > discussion and not likely to land before the ZK migration, I think
> > >> we'll
> > >> > > need to defer it. For migrating JBOD clusters from ZK to KRaft,
> > we'll
> > >> > also
> > >> > > need to address the log dir failure mechanism which currently
> uses a
> > >> > > special ZNode written to by the brokers. There is an old KIP
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > >> > > which proposes a new RPC to move that ZK write to the controller,
> > but
> > >> I'm
> > >> > > not sure if that's the approach we'll want to take. I read through
> > the
> > >> > > discussion over in KIP-858 and it sounds like there are some good
> > >> ideas
> > >> > > there.
> > >> > >
> > >> > > To answer your question more directly, migrating ZK clusters using
> > >> JBOD
> > >> > is
> > >> > > out of scope for this KIP. It might be possible to stop using JBOD
> > >> using
> > >> > > reassignments, but I'm not sure about that.
> > >> > >
> > >> > > -David
> > >> > >
> > >> > > On Tue, Oct 18, 2022 at 12:17 PM Igor Soarez <i...@soarez.me> wrote:
> > >> > >
> > >> > > > Hi David,
> > >> > > >
> > >> > > > Thanks for the KIP, this is very exciting!
> > >> > > >
> > >> > > > How does JBOD relate to this work? KRaft mode doesn't yet
> support
> > >> > > > configuring brokers with multiple log directories. If the
> brokers
> > in
> > >> > the
> > >> > > > existing cluster are configured with multiple log dirs, does the
> > >> > > migration
> > >> > > > imply that the existing brokers need to drop use of that
> feature?
> > >> Or is
> > >> > > > there some way to upgrade them later?
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > --
> > >> > > > Igor
> > >> > > >
> > >> > > > On Mon, Oct 17, 2022, at 10:07 PM, David Arthur wrote:
> > >> > > > > I've updated the KIP with the following changes (the
> confluence
> > >> diff
> > >> > is
> > >> > > > not
> > >> > > > > helpful here since i rearranged some things)
> > >> > > > >
> > >> > > > > * Added ZooKeeperBlockingKRaftMillis metric
> > >> > > > > * Added section on new broker registration JSON
> > >> > > > > * Removed section on MigrationCheck RPC
> > >> > > > > * Added change to UpdateMetadataRequest
> > >> > > > > * Added section "Additional ZK Broker Configs" (includes
> configs
> > >> to
> > >> > > > connect
> > >> > > > > to KRaft quorum)
> > >> > > > > * Added section on "Incompatible Brokers" under Failure Modes
> > >> > > > > * Clarified many things per this discussion thread
> > >> > > > >
> > >> > > > > I realized we need the KRaft controller to pick the correct
> > >> > > > > "metadata.version" when initializing the migration. I included
> > the
> > >> > IBP
> > >> > > > of a
> > >> > > > > broker in its registration data so the KRaft controller can
> > verify
> > >> > the
> > >> > > > IBP
> > >> > > > > and pick the correct "metadata.version" when starting the
> > >> migration.
> > >> > > > > Otherwise, we could inadvertently downgrade the
> > >> IBP/metadata.version
> > >> > as
> > >> > > > > part of the migration.
> > >> > > > >
> > >> > > > > I also added "clusterId" to the broker registration data so
> the
> > >> KRaft
> > >> > > > could
> > >> > > > > verify it.
> > >> > > > >
> > >> > > > > -David
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Mon, Oct 17, 2022 at 12:14 PM Colin McCabe <
> > cmcc...@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > >> On Fri, Oct 14, 2022, at 11:39, Jun Rao wrote:
> > >> > > > >> > Hi, Colin,
> > >> > > > >> >
> > >> > > > >> > 10. "That all goes away in the new mode, and we just have
> > some
> > >> > code
> > >> > > > which
> > >> > > > >> > analyzes __cluster_metadata and reflects it in 1) updates
> to
> > ZK
> > >> > and
> > >> > > 2)
> > >> > > > >> > messages sent out to brokers."
> > >> > > > >> > Hmm, I am not sure it's that simple. Some of the complexity
> > of
> > >> the
> > >> > > > >> ZK-based
> > >> > > > >> > controller are (1) using the pipelining approach to write
> to
> > ZK
> > >> > for
> > >> > > > >> better
> > >> > > > >> > throughput and using conditional writes for correctness;
> (2)
> > >> > sending
> > >> > > > the
> > >> > > > >> > proper LeaderAndIsr and UpdateMetadata requests. For
> example,
> > >> > during
> > >> > > > >> > controller failover, the full metadata needs to be sent
> while
> > >> > during
> > >> > > > >> > individual broker failure, only some of the metadata needs
> to
> > >> be
> > >> > > > updated.
> > >> > > > >> > The controlled shutdown handling sometimes uses
> > >> StopReplicaRequest
> > >> > > > and
> > >> > > > >> > some other times uses LeaderAndIsrRequest. (3) triggering
> new
> > >> > events
> > >> > > > >> based
> > >> > > > >> > on the responses of LeaderAndIsr (e.g. for topic deletion).
> > >> Some
> > >> > of
> > >> > > > those
> > >> > > > >> > complexity could be re-implemented in a more efficient way,
> > >> but we
> > >> > > > need
> > >> > > > >> to
> > >> > > > >> > be really careful not to generate regression. Some of the
> > other
> > >> > > > >> complexity
> > >> > > > >> > just won't go away. Reimplementing all those logic for the
> 30
> > >> or
> > >> > so
> > >> > > > >> events
> > >> > > > >> > in the ZK-based controller is possible, but seems a bit
> > >> daunting
> > >> > and
> > >> > > > >> risky.
> > >> > > > >>
> > >> > > > >> Hi Jun,
> > >> > > > >>
> > >> > > > >> Thanks for the response.  I agree that there is some work
> here
> > >> but I
> > >> > > > don't
> > >> > > > >> think it's as bad as it might seem. Most of the code for
> > writing
> > >> to
> > >> > ZK
> > >> > > > can
> > >> > > > >> be reused from the old controller. We should at least
> evaluate
> > >> using
> > >> > > the
> > >> > > > >> old ControllerChannelManager as well. I don't know if we'll
> end
> > >> up
> > >> > > > doing it
> > >> > > > >> or not but it's a possibility. (The main reason to not do it
> is
> > >> that
> > >> > > the
> > >> > > > >> response handling will be a bit different)
> > >> > > > >>
> > >> > > > >> The question of what to do during a controller failover is an
> > >> > > > interesting
> > >> > > > >> one. Technically, we should be able to continue sending
> > >> incremental
> > >> > > > updates
> > >> > > > >> to the legacy nodes, for the same reason we can in KRaft
> mode.
> > >> > > However,
> > >> > > > >> probably we should just copy what ZK mode does and send them
> a
> > >> full
> > >> > > > >> metadata update. This will allow us to have an easy way to
> > handle
> > >> > > > >> divergences caused by lost RPCs by changing the controller
> > (just
> > >> as
> > >> > we
> > >> > > > do
> > >> > > > >> in ZK mode, unfortunately). I suppose we should document this
> > in
> > >> the
> > >> > > > KIP...
> > >> > > > >>
> > >> > > > >> I agree controlled shutdown is tricky to get just right. I
> > >> suppose
> > >> > > this
> > >> > > > is
> > >> > > > >> a case where the RPCs we send out are not purely "fire and
> > >> forget";
> > >> > we
> > >> > > > have
> > >> > > > >> to listen for the response. But that can be done in an
> > >> event-based
> > >> > > way.
> > >> > > > >> Controlled shutdown is also probably the last thing we'll
> > >> implement
> > >> > > > once we
> > >> > > > >> have the basic lift and shift.
> > >> > > > >>
> > >> > > > >> Topic deletion is another annoying thing. I wonder if we
> could
> > >> just
> > >> > > > delete
> > >> > > > >> immediately here and skip the complexity of implementing
> > >> "deleting
> > >> > > > state."
> > >> > > > >> Topic IDs will exist in IBP 3.4, in both ZK and KRaft mode,
> so
> > it
> > >> > > > should be
> > >> > > > >> OK, right?
> > >> > > > >>
> > >> > > > >> best,
> > >> > > > >> Colin
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> >
> > >> > > > >> > Thanks,
> > >> > > > >> >
> > >> > > > >> > Jun
> > >> > > > >> >
> > >> > > > >> > On Fri, Oct 14, 2022 at 9:29 AM Colin McCabe <
> > >> cmcc...@apache.org>
> > >> > > > wrote:
> > >> > > > >> >
> > >> > > > >> >> On Thu, Oct 13, 2022, at 11:44, Jun Rao wrote:
> > >> > > > >> >> > Hi, Colin,
> > >> > > > >> >> >
> > >> > > > >> >> > Thanks for the reply.
> > >> > > > >> >> >
> > >> > > > >> >> > 10. This is a bit on the implementation side. If you
> look
> > at
> > >> > the
> > >> > > > >> existing
> > >> > > > >> >> > ZK-based controller, most of the logic is around
> > >> maintaining an
> > >> > > > >> in-memory
> > >> > > > >> >> > state of all the resources (broker, topic, partition,
> > etc),
> > >> > > > >> >> reading/writing
> > >> > > > >> >> > to ZK, sending LeaderAndIsr and UpdateMetadata requests
> > and
> > >> > > > handling
> > >> > > > >> the
> > >> > > > >> >> > responses to brokers. So we need all that logic in the
> > dual
> > >> > write
> > >> > > > >> mode.
> > >> > > > >> >> One
> > >> > > > >> >> > option is to duplicate all that logic in some new code.
> > This
> > >> > can
> > >> > > > be a
> > >> > > > >> bit
> > >> > > > >> >> > error prone and makes the code a bit harder to maintain
> if
> > >> we
> > >> > > need
> > >> > > > to
> > >> > > > >> fix
> > >> > > > >> >> > some critical issues in ZK-based controllers. Another
> > >> option is
> > >> > > to
> > >> > > > try
> > >> > > > >> >> > reusing the existing code in the ZK-based controller.
> For
> > >> > > example,
> > >> > > > we
> > >> > > > >> >> could
> > >> > > > >> >> > start the EventManager in the ZK-based controller, but
> let
> > >> the
> > >> > > > KRaft
> > >> > > > >> >> > controller ingest new events. This has its own
> challenges:
> > >> (1)
> > >> > > the
> > >> > > > >> >> existing
> > >> > > > >> >> > logic only logs ZK failures and doesn't expose them to
> the
> > >> > > caller;
> > >> > > > (2)
> > >> > > > >> >> the
> > >> > > > >> >> > existing logic may add new events to the queue itself
> and
> > we
> > >> > > > probably
> > >> > > > >> >> need
> > >> > > > >> >> > to think through how this is coordinated with the KRaft
> > >> > > controller;
> > >> > > > >> (3)
> > >> > > > >> >> it
> > >> > > > >> >> > registers some ZK listeners unnecessarily (may not be a
> > big
> > >> > > > concern).
> > >> > > > >> So
> > >> > > > >> >> we
> > >> > > > >> >> > need to get around those issues somehow. I am wondering
> if
> > >> we
> > >> > > have
> > >> > > > >> >> > considered both options and which approach we are
> leaning
> > >> > towards
> > >> > > > for
> > >> > > > >> the
> > >> > > > >> >> > implementation.
> > >> > > > >> >> >
> > >> > > > >> >>
> > >> > > > >> >> Yes, this is a good question. My take is that a big part
> of
> > >> the
> > >> > > > >> complexity
> > >> > > > >> >> in the old controller code results from the fact that we
> use
> > >> ZK
> > >> > as
> > >> > > a
> > >> > > > >> >> multi-writer database for propagating information between
> > >> > different
> > >> > > > >> >> components. So in the old controller, every write to ZK
> > needs
> > >> to
> > >> > be
> > >> > > > >> >> structured as a compare and swap to be fully correct.
> Every
> > >> time
> > >> > we
> > >> > > > get
> > >> > > > >> >> notified about something, it's usually in the form of
> "this
> > >> znode
> > >> > > > >> changed"
> > >> > > > >> >> which prompts a full reload of part of the data in ZK
> (which
> > >> > itself
> > >> > > > has
> > >> > > > >> >> multiple parts, loading, deserializing, reconciling, etc.)
> > >> That
> > >> > all
> > >> > > > goes
> > >> > > > >> >> away in the new mode, and we just have some code which
> > >> analyzes
> > >> > > > >> >> __cluster_metadata and reflects it in 1) updates to ZK and
> > 2)
> > >> > > > messages
> > >> > > > >> sent
> > >> > > > >> >> out to brokers.
> > >> > > > >> >>
> > >> > > > >> >> This is pretty decoupled from the other logic in
> > >> QuorumController
> > >> > > and
> > >> > > > >> >> should be easy to unit test, since the same inputs from
> the
> > >> log
> > >> > > > always
> > >> > > > >> >> produce the same output in ZK. Basically, ZK is write-only
> > for
> > >> > us,
> > >> > > > we do
> > >> > > > >> >> not read it (with the big exception of broker registration
> > >> > znodes)
> > >> > > > and I
> > >> > > > >> >> think that will greatly simplify things.
> > >> > > > >> >>
> > >> > > > >> >> So I think dual-write mode as described here will be
> > >> > substantially
> > >> > > > >> simpler
> > >> > > > >> >> than trying to run part or all of the old controller in
> > >> > parallel. I
> > >> > > > do
> > >> > > > >> >> think we will reuse a bunch of the serialization /
> > >> > deserialization
> > >> > > > code
> > >> > > > >> for
> > >> > > > >> >> znodes and possibly the code for communicating with ZK.
> > >> > > > >> >>
> > >> > > > >> >> best,
> > >> > > > >> >> Colin
> > >> > > > >> >>
> > >> > > > >> >>
> > >> > > > >> >> >
> > >> > > > >> >> > 14. Good point and make sense.
> > >> > > > >> >> >
> > >> > > > >> >> > Thanks,
> > >> > > > >> >> >
> > >> > > > >> >> > Jun
> > >> > > > >> >> >
> > >> > > > >> >> >
> > >> > > > >> >> >
> > >> > > > >> >> >
> > >> > > > >> >> > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <
> > >> > cmcc...@apache.org
> > >> > > >
> > >> > > > >> wrote:
> > >> > > > >> >> >
> > >> > > > >> >> >> Hi Jun,
> > >> > > > >> >> >>
> > >> > > > >> >> >> Thanks for taking a look. I can answer some questions
> > here
> > >> > > > because I
> > >> > > > >> >> >> collaborated on this a bit, and David is on vacation
> for
> > a
> > >> few
> > >> > > > days.
> > >> > > > >> >> >>
> > >> > > > >> >> >> On Wed, Oct 12, 2022, at 14:41, Jun Rao wrote:
> > >> > > > >> >> >> > Hi, David,
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > Thanks for the KIP. A few comments below.
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > 10. It's still not very clear to me how the KRaft
> > >> controller
> > >> > > > works
> > >> > > > >> in
> > >> > > > >> >> the
> > >> > > > >> >> >> > dual writes mode to KRaft log and ZK when the brokers
> > >> still
> > >> > > run
> > >> > > > in
> > >> > > > >> ZK
> > >> > > > >> >> >> mode.
> > >> > > > >> >> >> > Does the KRaft controller run a ZK based controller
> in
> > >> > > parallel
> > >> > > > or
> > >> > > > >> do
> > >> > > > >> >> we
> > >> > > > >> >> >> > derive what needs to be written to ZK based on KRaft
> > >> > > controller
> > >> > > > >> logic?
> > >> > > > >> >> >>
> > >> > > > >> >> >> We derive what needs to be written to ZK based on KRaft
> > >> > > controller
> > >> > > > >> >> logic.
> > >> > > > >> >> >>
> > >> > > > >> >> >> > I am also not sure how the KRaft controller handles
> > >> broker
> > >> > > > >> >> >> > registration/deregistration, since brokers are still
> > >> running
> > >> > > in
> > >> > > > ZK
> > >> > > > >> >> mode
> > >> > > > >> >> >> and
> > >> > > > >> >> >> > are not heartbeating to the KRaft controller.
> > >> > > > >> >> >>
> > >> > > > >> >> >> The new controller will listen for broker registrations
> > >> under
> > >> > > > >> /brokers.
> > >> > > > >> >> >> This is the only znode watch that the new controller
> will
> > >> do.
> > >> > > > >> >> >>
> > >> > > > >> >> >> We did consider changing how ZK-based broker
> registration
> > >> > > worked,
> > >> > > > >> but it
> > >> > > > >> >> >> just ended up being too much work for not enough gain.
> > >> > > > >> >> >>
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > 12. "A new set of nodes will be provisioned to host
> the
> > >> > > > controller
> > >> > > > >> >> >> quorum."
> > >> > > > >> >> >> > I guess we don't support starting the KRaft
> controller
> > >> > quorum
> > >> > > on
> > >> > > > >> >> existing
> > >> > > > >> >> >> > brokers. It would be useful to make that clear.
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >> >> Agreed
> > >> > > > >> >> >>
> > >> > > > >> >> >> > 13. "Once the quorum is established and a leader is
> > >> elected,
> > >> > > the
> > >> > > > >> >> >> controller
> > >> > > > >> >> >> > will check the state of the cluster using the
> > >> MigrationCheck
> > >> > > > RPC."
> > >> > > > >> How
> > >> > > > >> >> >> does
> > >> > > > >> >> >> > the quorum controller detect other brokers? Does the
> > >> > > controller
> > >> > > > >> node
> > >> > > > >> >> need
> > >> > > > >> >> >> > to be configured with ZK connection string? If so, it
> > >> would
> > >> > be
> > >> > > > >> useful
> > >> > > > >> >> to
> > >> > > > >> >> >> > document the additional configs that the quorum
> > >> controller
> > >> > > > needs to
> > >> > > > >> >> set.
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >> >> Yes, the controllers monitor ZK for broker
> registrations,
> > >> as I
> > >> > > > >> mentioned
> > >> > > > >> >> >> above. So they need zk.connect and the other ZK
> > connection
> > >> > > > >> >> configurations.
> > >> > > > >> >> >>
> > >> > > > >> >> >> > 14. "In order to prevent further writes to ZK, the
> > first
> > >> > thing
> > >> > > > the
> > >> > > > >> new
> > >> > > > >> >> >> > KRaft quorum must do is take over leadership of the
> ZK
> > >> > > > controller.
> > >> > > > >> "
> > >> > > > >> >> The
> > >> > > > >> >> >> ZK
> > >> > > > >> >> >> > controller processing changes to /controller update
> > >> > > > asynchronously.
> > >> > > > >> >> How
> > >> > > > >> >> >> > does the KRaft controller know when the ZK controller
> > has
> > >> > > > resigned
> > >> > > > >> >> before
> > >> > > > >> >> >> > it can safely copy the ZK data?
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >> >> This should be done through
> > >> expectedControllerEpochZkVersion,
> > >> > > just
> > >> > > > >> like
> > >> > > > >> >> in
> > >> > > > >> >> >> ZK mode, right? We should bump this epoch value so that
> > any
> > >> > > writes
> > >> > > > >> from
> > >> > > > >> >> the
> > >> > > > >> >> >> old controller will not go through. I agree we should
> > spell
> > >> > this
> > >> > > > out
> > >> > > > >> in
> > >> > > > >> >> the
> > >> > > > >> >> >> KIP.
> > >> > > > >> >> >>
> > >> > > > >> >> >> > 15. We have the following sentences. One says
> > >> ControllerId
> > >> > is
> > >> > > a
> > >> > > > >> random
> > >> > > > >> >> >> > KRaft broker and the other says it's the active
> > >> controller.
> > >> > > > Which
> > >> > > > >> one
> > >> > > > >> >> is
> > >> > > > >> >> >> > correct?
> > >> > > > >> >> >> > "UpdateMetadata: for certain metadata changes, the
> > KRaft
> > >> > > > controller
> > >> > > > >> >> will
> > >> > > > >> >> >> > need to send UpdateMetadataRequests to the ZK
> brokers.
> > >> For
> > >> > the
> > >> > > > >> >> >> > “ControllerId” field in this request, the controller
> > >> should
> > >> > > > >> specify a
> > >> > > > >> >> >> > random KRaft broker."
> > >> > > > >> >> >> > "In the UpdateMetadataRequest sent by the KRaft
> > >> controller
> > >> > to
> > >> > > > the
> > >> > > > >> ZK
> > >> > > > >> >> >> > brokers, the ControllerId will point to the active
> > >> > controller
> > >> > > > which
> > >> > > > >> >> will
> > >> > > > >> >> >> be
> > >> > > > >> >> >> > used for the inter-broker requests."
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >> >> Yeah, this seems like an error to me as well. A random
> > >> value
> > >> > is
> > >> > > > not
> > >> > > > >> >> really
> > >> > > > >> >> >> useful. Plus the text here is self-contradictory, as
> you
> > >> > pointed
> > >> > > > out.
> > >> > > > >> >> >>
> > >> > > > >> >> >> I suspect what we should do here is add a new field,
> > >> > > > >> KRaftControllerId,
> > >> > > > >> >> >> and populate it with the real controller ID, and leave
> > the
> > >> old
> > >> > > > >> >> controllerId
> > >> > > > >> >> >> field as -1. A ZK-based broker that sees this can then
> > >> consult
> > >> > > its
> > >> > > > >> >> >> controller.quorum.voters configuration to see where it
> > >> should
> > >> > > send
> > >> > > > >> >> >> controller-bound RPCs. That (static) configuration lets
> > us
> > >> map
> > >> > > > >> between
> > >> > > > >> >> >> controller ID and host:port.
> > >> > > > >> >> >>
> > >> > > > >> >> >> We should still keep our existing epoch logic for
> > deciding
> > >> > when
> > >> > > > >> >> >> UpdateMetadataRequest / LeaderAndIsrRequests are stale,
> > >> with
> > >> > the
> > >> > > > >> caveat
> > >> > > > >> >> >> that any kraft-based epoch should be treated as greater
> > >> than
> > >> > any
> > >> > > > >> >> ZK-based
> > >> > > > >> >> >> epoch. After all, the kraft epoch is coming from the
> > epoch
> > >> of
> > >> > > > >> >> >> __cluster_metadata, whereas the ZK epoch comes from ZK.
> > >> > > > >> >> >>
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > 16. "Additionally, the controller must specify if a
> > >> broker
> > >> > in
> > >> > > > >> >> >> “LiveBrokers”
> > >> > > > >> >> >> > is KRaft or ZK." Does that require any protocol
> changes
> > >> to
> > >> > > > >> >> >> UpdateMetadata?
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >> >> Yeah, I am also curious why the we need to care whether
> > >> > brokers
> > >> > > > are
> > >> > > > >> ZK
> > >> > > > >> >> or
> > >> > > > >> >> >> KRaft in UpdateMetadataRequest. We don't reveal this to
> > >> > clients,
> > >> > > > so
> > >> > > > >> can
> > >> > > > >> >> we
> > >> > > > >> >> >> just leave this out?
> > >> > > > >> >> >>
> > >> > > > >> >> >> best,
> > >> > > > >> >> >> Colin
> > >> > > > >> >> >>
> > >> > > > >> >> >> > Thanks,
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > Jun
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > On Wed, Oct 5, 2022 at 10:07 AM Mickael Maison <
> > >> > > > >> >> mickael.mai...@gmail.com
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > wrote:
> > >> > > > >> >> >> >
> > >> > > > >> >> >> >> Hi David,
> > >> > > > >> >> >> >>
> > >> > > > >> >> >> >> Thanks for starting this important KIP.
> > >> > > > >> >> >> >>
> > >> > > > >> >> >> >> I've just taken a quick look so far but I've got a
> > >> couple
> > >> > of
> > >> > > > >> initial
> > >> > > > >> >> >> >> questions:
> > >> > > > >> >> >> >>
> > >> > > > >> >> >> >> 1) What happens if a non KRaft compatible broker (or
> > >> with
> > >> > > > >> >> >> >> kafka.metadata.migration.enable set to false) joins
> > the
> > >> > > cluster
> > >> > > > >> after
> > >> > > > >> >> >> >> the migration is triggered?
> > >> > > > >> >> >> >>
> > >> > > > >> >> >> >> 2) In the Failure Modes section you mention a
> scenario
> > >> > where
> > >> > > a
> > >> > > > >> write
> > >> > > > >> >> >> >> to ZK fails. What happens when the divergence limit
> is
> > >> > > > reached? Is
> > >> > > > >> >> >> >> this a fatal condition? How much divergence should
> we
> > >> > allow?
> > >> > > > >> >> >> >>
> > >> > > > >> >> >> >> Thanks,
> > >> > > > >> >> >> >> Mickael
> > >> > > > >> >> >> >>
> > >> > > > >> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM David Arthur <
> > >> > > mum...@gmail.com
> > >> > > > >
> > >> > > > >> >> wrote:
> > >> > > > >> >> >> >> >
> > >> > > > >> >> >> >> > Hey folks, I wanted to get the ball rolling on the
> > >> > > discussion
> > >> > > > >> for
> > >> > > > >> >> the
> > >> > > > >> >> >> >> > ZooKeeper migration KIP. This KIP details how we
> > plan
> > >> to
> > >> > do
> > >> > > > an
> > >> > > > >> >> online
> > >> > > > >> >> >> >> > migration of metadata from ZooKeeper to KRaft as
> > well
> > >> as
> > >> > a
> > >> > > > >> rolling
> > >> > > > >> >> >> >> > upgrade of brokers to KRaft mode.
> > >> > > > >> >> >> >> >
> > >> > > > >> >> >> >> > The general idea is to keep KRaft and ZooKeeper in
> > >> sync
> > >> > > > during
> > >> > > > >> the
> > >> > > > >> >> >> >> > migration, so both types of brokers can exist
> > >> > > simultaneously.
> > >> > > > >> Then,
> > >> > > > >> >> >> >> > once everything is migrated and updated, we can
> turn
> > >> off
> > >> > > > >> ZooKeeper
> > >> > > > >> >> >> >> > writes.
> > >> > > > >> >> >> >> >
> > >> > > > >> >> >> >> > This is a pretty complex KIP, so please take a
> look
> > :)
> > >> > > > >> >> >> >> >
> > >> > > > >> >> >> >> >
> > >> > > > >> >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
> > >> > > > >> >> >> >> >
> > >> > > > >> >> >> >> > Thanks!
> > >> > > > >> >> >> >> > David
> > >> > > > >> >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >>
> > >> > > > >>
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > -David
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > -David
> > >> > >
> > >> >
> > >>
> > >
> > >
> > > --
> > > -David
> > >
> >
> >
> > --
> > -David
> >
>


-- 
-David

Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Reply via email to