Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Jun Rao Mon, 31 Oct 2022 11:54:36 -0700

Hi, David,

Thanks for the updated KIP. A few more comments.


30. LeaderAndIsrRequest/StopReplicaRequest both have a controllerId field.
Should we add a KRaftControllerId field like UpdateMetadata?

31. "If a migration has been finalized, but the KRaft quroum comes up with
kafka.metadata.migration.enable, we must not re-enter the migration mode.
In this case, while replaying the log, the controller can see the second
MigrationRecord and know that the migration is finalized and should not be
resumed. " Hmm, do we want to keep the MigrationRecord in the snapshot and
the metadata log forever after migration is finalized? If not, we can't
know for sure whether a migration has happened or not. Also, it might be
useful to support switching back to ZK mode after the migration is
finalized, with the understanding of potential metadata loss. In that case,
we could just trim all metadata log and recopy the ZK metadata back.

32. The /migration node in ZK: Do we need last_update_time_ms since ZK
Stats already has an MTime? Also, how do we plan to use the
kraft_controller_id and kraft_controller_epoch fields?

33. Controller migration: We will force a write to the "/controller" and
"/controller_epoch" ZNodes before copying ZK data, right?

34. "Operator can remove the persistent "/controller" and
"/controller_epoch" nodes allowing for ZK controller election to take
place". I guess the operator only needs to remove the /controller path?

Thanks,

Jun

On Mon, Oct 31, 2022 at 7:17 AM David Arthur
<david.art...@confluent.io.invalid> wrote:

> Happy Monday, everyone! I've updated the KIP with the following changes:
>
> * Clarified MetadataType metric usages (broker vs controller)
> * Added ZkMigrationReady tagged field to ApiVersionsResponse (for use by
> KRaft controller quorum)
> * Added MigrationRecord with two states: Started and Finished
> * Documented ZK configs for KRaft controller
> * Simplified state machine description (internally, more states will exist,
> but only the four documented are interesting to operators)
> * Clarified some things in Controller Migration section
> * Removed KRaft -> ZK parts of Broker Registration
> * Added Misconfigurations section to Failure Modes
>
> Let me know if I've missed anything from the past two weeks of discussion.
>
> Thanks again to everyone who has reviewed this KIP so far!
>
> -David
>
> On Fri, Oct 28, 2022 at 2:26 PM Jun Rao <j...@confluent.io.invalid> wrote:
>
> > Hi, David,
> >
> > Thanks for the reply.
> >
> > 20/21. Sounds good.
> >
> > Could you update the doc with all the changes being discussed?
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Oct 28, 2022 at 10:11 AM David Arthur
> > <david.art...@confluent.io.invalid> wrote:
> >
> > > Jun,
> > >
> > > 20/21. I was also wondering about a "migration" record. In addition to
> > the
> > > scenario you mentioned, we also need a way to prevent the cluster from
> > > re-entering the dual write mode after the migration has been
> finalized. I
> > > could see this happening inadvertently via a change in some
> configuration
> > > management system. How about we add a record that marks the beginning
> and
> > > end of the dual-write mode. The first occurrence of the record could be
> > > included in the metadata transaction when we migrate data from ZK.
> > >
> > > With this, the active controller would decide whether to enter dual
> write
> > > mode, finalize the migration based, or fail based on:
> > >
> > > * Metadata log state
> > > * It's own configuration ("kafka.metadata.migration.enable",
> > > "zookeeper.connect", etc)
> > > * The other controllers configuration (via ApiVersionsResponse)
> > >
> > > WDYT?
> > >
> > > 22. Since we will need the fencing anyways as a safe-guard, then I
> agree
> > > would could skip the registration of KRaft brokers in ZK to simply
> > things a
> > > bit.
> > >
> > > Thanks,
> > > David
> > >
> > >
> > >
> > > On Thu, Oct 27, 2022 at 5:11 PM Jun Rao <j...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, David,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 20/21. Relying upon the presence of ZK configs to determine whether
> the
> > > > KRaft controller is in a dual write mode seems a bit error prone. If
> > > > someone accidentally adds a ZK configuration to a brand new KRaft
> > > cluster,
> > > > ideally it shouldn't cause the controller to get into a weird state.
> > Have
> > > > we considered storing the migration state in a metadata record?
> > > >
> > > > 22. If we have the broker fencing logic, do we need to write the
> broker
> > > > registration path in ZK for KRaft brokers at all?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Thu, Oct 27, 2022 at 1:02 PM David Arthur
> > > > <david.art...@confluent.io.invalid> wrote:
> > > >
> > > > > Jun,
> > > > >
> > > > > 20/21. A KRaft controller will recover the migration state by
> reading
> > > the
> > > > > "/migration" ZNode. If the migration enable config is set, and the
> ZK
> > > > > migration is complete, it will enter the dual-write mode. Before an
> > > > > operator can decommission ZK, they will need to finalize the
> > migration
> > > > > which involves removing the migration config and the ZK config.
> I'll
> > > > > clarify this in the KIP.
> > > > >
> > > > > 22. Yea, we could see an incorrect broker ID during that window.
> If
> > we
> > > > > ended up with a state where we saw a ZK broker ID that conflicted
> > with
> > > a
> > > > > KRaft broker ID, we would need to fence one of them. I would
> probably
> > > opt
> > > > > to fence the KRaft broker in that case since broker registration
> and
> > > > > fencing is more robust in KRaft. Hopefully this is a rare case.
> > > > >
> > > > > 26. Sounds good.
> > > > >
> > > > > Thanks!
> > > > > David
> > > > >
> > > > >
> > > > > On Thu, Oct 27, 2022 at 1:34 PM Jun Rao <j...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, David,
> > > > > >
> > > > > > Thanks for the reply. A few more comments.
> > > > > >
> > > > > > 20/21. Using a tagged field in ApiVersionRequest could work.
> > Related
> > > to
> > > > > > this, how does a KRaft controller know that it's in the dual
> write
> > > > mode?
> > > > > > Does it need to read the /controller path from ZK? After the
> > > migration,
> > > > > > people may have the ZK cluster decommissioned, but still have the
> > ZK
> > > > > > configs left in the KRaft controller. Will this cause the KRaft
> > > > > controller
> > > > > > to be stuck because it doesn't know which mode it is in?
> > > > > >
> > > > > > 22. Using the ephemeral node matches the current ZK-based broker
> > > > behavior
> > > > > > better. However, it leaves a window for incorrect broker
> > registration
> > > > to
> > > > > > sneak in during KRaft controller failover.
> > > > > >
> > > > > > 26. Then, we could just remove Broker Registration in that
> section.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Wed, Oct 26, 2022 at 2:21 PM David Arthur
> > > > > > <david.art...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > 20/21 It could definitely cause problems if we failover to a
> > > > controller
> > > > > > > without "kafka.metadata.migration.enable". The only mechanism I
> > > know
> > > > of
> > > > > > for
> > > > > > > controllers to learn things about one another is ApiVersions.
> We
> > > > > > currently
> > > > > > > use this for checking support for "metadata.version" (in KRaft
> > > mode).
> > > > > We
> > > > > > > could add a "zk.migration" feature flag that's enabled on a
> > > > controller
> > > > > > only
> > > > > > > if the config is set. Another possibility would be a tagged
> field
> > > on
> > > > > > > ApiVersionResponse that indicated if the config was set. Both
> > seem
> > > > > > somewhat
> > > > > > > inelegant. I think a tagged field would be a bit simpler (and
> > > > arguably
> > > > > > less
> > > > > > > hacky).
> > > > > > >
> > > > > > > For 20, we could avoid entering the migration state unless the
> > > whole
> > > > > > quorum
> > > > > > > had the field present in their NodeApiVersions. For 21, we
> could
> > > > avoid
> > > > > > > leaving the migration state unless the whole quorum did not
> have
> > > the
> > > > > > field
> > > > > > > in their NodeApiVersions. Do you think this would be
> sufficient?
> > > > > > >
> > > > > > > 22. Right, we need to write the broker info back to ZK just as
> a
> > > > > > safeguard
> > > > > > > against incorrect broker IDs getting registered into ZK. I was
> > > > thinking
> > > > > > > these would be persistent nodes, but it's probably fine to make
> > > them
> > > > > > > ephemeral and have the active KRaft controller keep them up to
> > > date.
> > > > > > >
> > > > > > > 23. Right. When the broker comes up as a KRaft broker, it's old
> > > > > > > /brokers/ids ZNode will be gone and it will register itself
> with
> > > the
> > > > > > KRaft
> > > > > > > controller. The controller will know that it is now in KRaft
> mode
> > > and
> > > > > > will
> > > > > > > stop sending it the old RPCs.
> > > > > > >
> > > > > > > 24. Ok, I'll add these
> > > > > > >
> > > > > > > 25. I realize now the "/controller_epoch" node is already
> > > persistent.
> > > > > It
> > > > > > > should be sufficient to remove the "/controller" node to
> trigger
> > an
> > > > > > > election.
> > > > > > >
> > > > > > > 26. Hmm, not sure, but I don't think it uses watches.
> KafkaServer
> > > > > > registers
> > > > > > > the broker info and later loads all brokers in the cluster to
> > check
> > > > > that
> > > > > > > the endpoints don't conflict. Things that do use watches are
> > > dynamic
> > > > > > > configs, ACLs, and some others (listed in the KIP).
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Oct 26, 2022 at 4:26 PM David Arthur <
> > > > > david.art...@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Luke and Andrew, thanks for taking a look!
> > > > > > > >
> > > > > > > > I think the names of the state machine in the KIP aren't the
> > > best.
> > > > > I'll
> > > > > > > > try to improve that section in general.
> > > > > > > >
> > > > > > > > 1. "MigrationReady" is probably better named
> "MigratingFromZk"
> > or
> > > > > > > > something. It's meant to be the state when the KRaft
> controller
> > > is
> > > > > > > actively
> > > > > > > > migrating the data out of ZK. This happens after we have
> > detected
> > > > > that
> > > > > > > the
> > > > > > > > cluster is eligible and before we enter the dual write mode.
> > > > > > > >
> > > > > > > > 2. "MigrationActive" should probably be called
> > "BrokerMigration".
> > > > > > > > Transitioning to "MigrationFinished" can happen automatically
> > > when
> > > > > all
> > > > > > > the
> > > > > > > > known ZK brokers have been migrated. Since we don't have
> > > permanent
> > > > > > > > registrations of ZK brokers, we can use the partition
> > assignments
> > > > as
> > > > > a
> > > > > > > > proxy for this.
> > > > > > > >
> > > > > > > > 3. A metric for unready brokers makes sense. We can also log
> a
> > > > > message
> > > > > > on
> > > > > > > > the controller when it tries to start the migration
> > > > > > > >
> > > > > > > > 4. The MetadataType metric is meant to help see this. E.g.,
> > some
> > > > > > brokers
> > > > > > > > would have MetadataType=ZK, some would have MetadataType=Dual
> > > > > > > >
> > > > > > > > 5. On ZK brokers, not having this config set would prevent
> the
> > > new
> > > > > > broker
> > > > > > > > registration data from being written to ZK. This means the
> > KRaft
> > > > > > > controller
> > > > > > > > won't send RPCs to it. If a broker has been migrated to KRaft
> > > > > already,
> > > > > > > I'm
> > > > > > > > not sure there is any harm in removing this config. If we
> > decide
> > > > that
> > > > > > we
> > > > > > > > need to guarantee that KRaft brokers have this config set
> > during
> > > > the
> > > > > > > > migration, we can include it in the broker registration
> that's
> > > sent
> > > > > to
> > > > > > > > KRaft. This would let the controller keep that broker fenced.
> > > > > > > >
> > > > > > > > 6. Once the KRaft controller takes leadership, the ZK
> > controller
> > > > > won't
> > > > > > be
> > > > > > > > active any more and will stop reporting the MigrationState
> > > metric.
> > > > > > > >
> > > > > > > > 7. The "Dual" MetadataType is reported by brokers running in
> > > KRaft
> > > > > mode
> > > > > > > > when the migration enable config is set. I think the
> controller
> > > > > should
> > > > > > > also
> > > > > > > > report this, not just the brokers. I'll clarify in the KIP.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Andrew:
> > > > > > > >
> > > > > > > > > How will the code get all the ZooKeeper config? Will it do
> > some
> > > > > sort
> > > > > > of
> > > > > > > > scan of the ZooKeeper data store? ... Do we need to do
> anything
> > > > > special
> > > > > > > to
> > > > > > > > make sure we get all data from ZooKeeper?
> > > > > > > >
> > > > > > > > With a few exceptions, all data written to ZK by Kafka
> happens
> > on
> > > > the
> > > > > > > > controller (single writer principle). In our migration code,
> we
> > > can
> > > > > > > > unconditionally update the "/controller" and
> > "/controller_epoch"
> > > > > ZNodes
> > > > > > > and
> > > > > > > > effectively force the migration component to gain leadership
> of
> > > the
> > > > > ZK
> > > > > > > > controller. Once this happens, we don't expect any data to be
> > > > written
> > > > > > to
> > > > > > > > ZK, so we can read it iteratively without worrying about any
> > > > > concurrent
> > > > > > > > writes.
> > > > > > > >
> > > > > > > > As for the linearizable thing, I believe this just means that
> > > reads
> > > > > may
> > > > > > > be
> > > > > > > > served by a quorum follower which has stale data. We'll be
> > > reading
> > > > > from
> > > > > > > ZK
> > > > > > > > the same way the ZK controller does, so I think we will be
> fine
> > > > > > regarding
> > > > > > > > consistency. If we wanted to be extra careful, we could add a
> > > delay
> > > > > > prior
> > > > > > > > to iterating through the znodes to give partitioned ZK
> > followers
> > > a
> > > > > > chance
> > > > > > > > to get kicked out of the quorum. I don't think we'll need
> that
> > > > though
> > > > > > > >
> > > > > > > > > What happens if we commit the transaction then fail right
> > after
> > > > > that
> > > > > > > and
> > > > > > > > restart?
> > > > > > > >
> > > > > > > > If we commit the "migration" transaction to the metadata log,
> > it
> > > > > won't
> > > > > > be
> > > > > > > > a problem if we failover or restart. We can recover our state
> > > based
> > > > > on
> > > > > > > the
> > > > > > > > metadata log and what exists in ZK. If we see a migration
> > > > transaction
> > > > > > in
> > > > > > > > the kraft log, but no "/migration" ZNode, we'll know we
> failed
> > > > before
> > > > > > > > writing to ZK. If we see a partial transaction in the log,
> then
> > > we
> > > > > can
> > > > > > > > abort it and restart the migration.
> > > > > > > >
> > > > > > > > > This sort of leads to me wondering if we will/should check
> > the
> > > > > > metadata
> > > > > > > > log state before doing the migration? That could also catch
> > > > mistakes
> > > > > > such
> > > > > > > > as if a KRaft quorum is used that already had some metadata
> > > which I
> > > > > > > assume
> > > > > > > > we want to prevent.
> > > > > > > >
> > > > > > > > Yes, we should check that the metadata log is empty before
> > > > > attempting a
> > > > > > > > migration (perhaps excepting the bootstrap metadata -- need
> to
> > > > think
> > > > > on
> > > > > > > > that one still).
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Oct 24, 2022 at 8:57 AM Andrew Grant
> > > > > > <agr...@confluent.io.invalid
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hey David,
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thanks for the KIP. I had a few small questions.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> “The ZK data migration will copy the existing ZK data into
> the
> > > > KRaft
> > > > > > > >> metadata log and establish the new KRaft active controller
> as
> > > the
> > > > > > active
> > > > > > > >> controller from a ZK perspective.”
> > > > > > > >>
> > > > > > > >> How will the code get all the ZooKeeper config? Will it do
> > some
> > > > sort
> > > > > > of
> > > > > > > >> scan of the ZooKeeper data store? Also I’m not a ZooKeeper
> > > expert
> > > > > but
> > > > > > > per
> > > > > > > >>
> > > https://zookeeper.apache.org/doc/current/zookeeperInternals.html
> > > > > > “Read
> > > > > > > >> operations in ZooKeeper are *not linearizable* since they
> can
> > > > return
> > > > > > > >> potentially stale data.” Do we need to do anything special
> to
> > > make
> > > > > > sure
> > > > > > > we
> > > > > > > >> get all data from ZooKeeper?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> “For the initial migration, the controller will utilize
> > KIP-868
> > > > > > Metadata
> > > > > > > >> Transactions
> > > > > > > >> <
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-868+Metadata+Transactions
> > > > > > > >> >
> > > > > > > >> to
> > > > > > > >> write all of the ZK metadata in a single transaction. If the
> > > > > > controller
> > > > > > > >> fails before this transaction is finalized, the next active
> > > > > controller
> > > > > > > >> will
> > > > > > > >> abort the transaction and restart the migration process.”
> What
> > > > > happens
> > > > > > > if
> > > > > > > >> we commit the transaction then fail right after that and
> > > restart?
> > > > > This
> > > > > > > >> sort
> > > > > > > >> of leads to me wondering if we will/should check the
> metadata
> > > log
> > > > > > state
> > > > > > > >> before doing the migration? That could also catch mistakes
> > such
> > > as
> > > > > if
> > > > > > a
> > > > > > > >> KRaft quorum is used that already had some metadata which I
> > > assume
> > > > > we
> > > > > > > want
> > > > > > > >> to prevent.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> For the Test Plan section do you think it’s worth calling
> out
> > > > > > > performance
> > > > > > > >> testing migrations of ZooKeeper deployments that have
> “large”
> > > > > amounts
> > > > > > of
> > > > > > > >> metadata?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >>
> > > > > > > >> Andrew
> > > > > > > >>
> > > > > > > >> On Mon, Oct 24, 2022 at 3:20 AM Luke Chen <
> show...@gmail.com>
> > > > > wrote:
> > > > > > > >>
> > > > > > > >> > Hi David,
> > > > > > > >> >
> > > > > > > >> > Thanks for the good and complicated proposal! :)
> > > > > > > >> > Some questions:
> > > > > > > >> >
> > > > > > > >> > 1. "MigrationReady" state: The KIP said:
> > > > > > > >> > The KRaft quorum has been started
> > > > > > > >> > I don't think we'll automatically enter this state after
> > KRaft
> > > > > > quorum
> > > > > > > >> > started, do we?
> > > > > > > >> > I think KRaft active controller should take over
> leadership
> > by
> > > > > > > writing a
> > > > > > > >> > znode in /controller path before we entering this sate, is
> > > that
> > > > > > > correct?
> > > > > > > >> >
> > > > > > > >> > 2. "MigrationActive" state: the KIP said:
> > > > > > > >> > ZK state has been migrated, controller is in dual-write
> > mode,
> > > > > > brokers
> > > > > > > >> are
> > > > > > > >> > being restarted in KRaft mode
> > > > > > > >> > What confuses me is the last part: "brokers are being
> > > restarted
> > > > in
> > > > > > > KRaft
> > > > > > > >> > mode".
> > > > > > > >> > How could we detect brokers are being restarted in KRaft
> > mode?
> > > > Old
> > > > > > ZK
> > > > > > > >> > broker is removed and new KRaft broker is up within N
> > minutes?
> > > > > > > >> > I think we don't have to rely on the condition "brokers
> are
> > > > being
> > > > > > > >> restarted
> > > > > > > >> > in KRaft mode" to enter this state.
> > > > > > > >> > "brokers are being restarted in KRaft mode" should be a
> > > process
> > > > > > > happened
> > > > > > > >> > between "MigrationActive" and "MigrationFinished". Does
> that
> > > > make
> > > > > > > sense?
> > > > > > > >> >
> > > > > > > >> > 3. When "Zookeeper" mode trying to enter
> > "MigrationEligible",
> > > if
> > > > > > there
> > > > > > > >> is a
> > > > > > > >> > cluster with tens of brokers, how could users know why
> this
> > > > > cluster
> > > > > > > >> cannot
> > > > > > > >> > be in "MigrationEligible" state, yet? Check that znode
> > > manually
> > > > > one
> > > > > > by
> > > > > > > >> one?
> > > > > > > >> > Or do we plan to create a tool to help them? Or maybe
> expose
> > > the
> > > > > > > >> "unready
> > > > > > > >> > ZK brokers" metrics?
> > > > > > > >> >
> > > > > > > >> > 4. Same for "MigrationActive" entering "MigrationFinished"
> > > > state.
> > > > > > > Since
> > > > > > > >> > that will be some process for restarting ZK broker into
> > KRaft
> > > > > broker
> > > > > > > >> one by
> > > > > > > >> > one, could we expose the "remaining ZK brokers" as
> metrics?
> > > > > > > >> >
> > > > > > > >> > 5. When users are in
> > > > > > > >> "MigrationReady"/"MigrationActive"/"MigrationFinished"
> > > > > > > >> > states, and they accidentally change the KRaft controller
> > > > config:
> > > > > > > >> > "kafka.metadata.migration.enable"
> > > > > > > >> > to false. What will happen to this cluster? No dual-write
> > for
> > > > it?
> > > > > > Will
> > > > > > > >> we
> > > > > > > >> > have any protection for it?
> > > > > > > >> >
> > > > > > > >> > 6. About the "MigrationState" metric:
> > > > > > > >> > The "ZooKeeper" and "MigrationEligible" is reported by ZK
> > > > > > controller,
> > > > > > > >> and
> > > > > > > >> > the rest of states are reported by KRaft controller -->
> > makes
> > > > > sense
> > > > > > > to
> > > > > > > >> me.
> > > > > > > >> > One question from it is, when KRaft controller takes over
> > the
> > > > > > > leadership
> > > > > > > >> > from ZK controller, what will the "MigrationState" value
> in
> > > old
> > > > ZK
> > > > > > > >> > controller? keep in "MigrationEligible" doesn't make
> sense.
> > > Will
> > > > > > there
> > > > > > > >> be a
> > > > > > > >> > empty or null state?
> > > > > > > >> >
> > > > > > > >> > 7. About the "MetadataType" metric:
> > > > > > > >> > An enumeration of: ZooKeeper (1), Dual (2), KRaft (3).
> Each
> > > > broker
> > > > > > > >> reports
> > > > > > > >> > this.
> > > > > > > >> > I don't know how we could map the migration state to
> these 3
> > > > > types.
> > > > > > > >> > What is the metadataType when cluster in "MigrationReady"
> > > state?
> > > > > > Still
> > > > > > > >> > Zookeeper?
> > > > > > > >> > When will brokers enter Dual type?
> > > > > > > >> > This is unclear in the KIP.
> > > > > > > >> >
> > > > > > > >> > Thank you.
> > > > > > > >> > Luke
> > > > > > > >> >
> > > > > > > >> > On Thu, Oct 20, 2022 at 11:33 PM David Arthur
> > > > > > > >> > <david.art...@confluent.io.invalid> wrote:
> > > > > > > >> >
> > > > > > > >> > > Igor, thanks for taking a look! Since JBOD in KRaft is
> > still
> > > > > under
> > > > > > > >> > > discussion and not likely to land before the ZK
> > migration, I
> > > > > think
> > > > > > > >> we'll
> > > > > > > >> > > need to defer it. For migrating JBOD clusters from ZK to
> > > > KRaft,
> > > > > > > we'll
> > > > > > > >> > also
> > > > > > > >> > > need to address the log dir failure mechanism which
> > > currently
> > > > > > uses a
> > > > > > > >> > > special ZNode written to by the brokers. There is an old
> > KIP
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> > > > > > > >> > > which proposes a new RPC to move that ZK write to the
> > > > > controller,
> > > > > > > but
> > > > > > > >> I'm
> > > > > > > >> > > not sure if that's the approach we'll want to take. I
> read
> > > > > through
> > > > > > > the
> > > > > > > >> > > discussion over in KIP-858 and it sounds like there are
> > some
> > > > > good
> > > > > > > >> ideas
> > > > > > > >> > > there.
> > > > > > > >> > >
> > > > > > > >> > > To answer your question more directly, migrating ZK
> > clusters
> > > > > using
> > > > > > > >> JBOD
> > > > > > > >> > is
> > > > > > > >> > > out of scope for this KIP. It might be possible to stop
> > > using
> > > > > JBOD
> > > > > > > >> using
> > > > > > > >> > > reassignments, but I'm not sure about that.
> > > > > > > >> > >
> > > > > > > >> > > -David
> > > > > > > >> > >
> > > > > > > >> > > On Tue, Oct 18, 2022 at 12:17 PM Igor Soarez <
> i...@soarez.me
> > >
> > > > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Hi David,
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks for the KIP, this is very exciting!
> > > > > > > >> > > >
> > > > > > > >> > > > How does JBOD relate to this work? KRaft mode doesn't
> > yet
> > > > > > support
> > > > > > > >> > > > configuring brokers with multiple log directories. If
> > the
> > > > > > brokers
> > > > > > > in
> > > > > > > >> > the
> > > > > > > >> > > > existing cluster are configured with multiple log
> dirs,
> > > does
> > > > > the
> > > > > > > >> > > migration
> > > > > > > >> > > > imply that the existing brokers need to drop use of
> that
> > > > > > feature?
> > > > > > > >> Or is
> > > > > > > >> > > > there some way to upgrade them later?
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks,
> > > > > > > >> > > >
> > > > > > > >> > > > --
> > > > > > > >> > > > Igor
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Oct 17, 2022, at 10:07 PM, David Arthur wrote:
> > > > > > > >> > > > > I've updated the KIP with the following changes (the
> > > > > > confluence
> > > > > > > >> diff
> > > > > > > >> > is
> > > > > > > >> > > > not
> > > > > > > >> > > > > helpful here since i rearranged some things)
> > > > > > > >> > > > >
> > > > > > > >> > > > > * Added ZooKeeperBlockingKRaftMillis metric
> > > > > > > >> > > > > * Added section on new broker registration JSON
> > > > > > > >> > > > > * Removed section on MigrationCheck RPC
> > > > > > > >> > > > > * Added change to UpdateMetadataRequest
> > > > > > > >> > > > > * Added section "Additional ZK Broker Configs"
> > (includes
> > > > > > configs
> > > > > > > >> to
> > > > > > > >> > > > connect
> > > > > > > >> > > > > to KRaft quorum)
> > > > > > > >> > > > > * Added section on "Incompatible Brokers" under
> > Failure
> > > > > Modes
> > > > > > > >> > > > > * Clarified many things per this discussion thread
> > > > > > > >> > > > >
> > > > > > > >> > > > > I realized we need the KRaft controller to pick the
> > > > correct
> > > > > > > >> > > > > "metadata.version" when initializing the migration.
> I
> > > > > included
> > > > > > > the
> > > > > > > >> > IBP
> > > > > > > >> > > > of a
> > > > > > > >> > > > > broker in its registration data so the KRaft
> > controller
> > > > can
> > > > > > > verify
> > > > > > > >> > the
> > > > > > > >> > > > IBP
> > > > > > > >> > > > > and pick the correct "metadata.version" when
> starting
> > > the
> > > > > > > >> migration.
> > > > > > > >> > > > > Otherwise, we could inadvertently downgrade the
> > > > > > > >> IBP/metadata.version
> > > > > > > >> > as
> > > > > > > >> > > > > part of the migration.
> > > > > > > >> > > > >
> > > > > > > >> > > > > I also added "clusterId" to the broker registration
> > data
> > > > so
> > > > > > the
> > > > > > > >> KRaft
> > > > > > > >> > > > could
> > > > > > > >> > > > > verify it.
> > > > > > > >> > > > >
> > > > > > > >> > > > > -David
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Mon, Oct 17, 2022 at 12:14 PM Colin McCabe <
> > > > > > > cmcc...@apache.org
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > >> On Fri, Oct 14, 2022, at 11:39, Jun Rao wrote:
> > > > > > > >> > > > >> > Hi, Colin,
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > 10. "That all goes away in the new mode, and we
> > just
> > > > have
> > > > > > > some
> > > > > > > >> > code
> > > > > > > >> > > > which
> > > > > > > >> > > > >> > analyzes __cluster_metadata and reflects it in 1)
> > > > updates
> > > > > > to
> > > > > > > ZK
> > > > > > > >> > and
> > > > > > > >> > > 2)
> > > > > > > >> > > > >> > messages sent out to brokers."
> > > > > > > >> > > > >> > Hmm, I am not sure it's that simple. Some of the
> > > > > complexity
> > > > > > > of
> > > > > > > >> the
> > > > > > > >> > > > >> ZK-based
> > > > > > > >> > > > >> > controller are (1) using the pipelining approach
> to
> > > > write
> > > > > > to
> > > > > > > ZK
> > > > > > > >> > for
> > > > > > > >> > > > >> better
> > > > > > > >> > > > >> > throughput and using conditional writes for
> > > > correctness;
> > > > > > (2)
> > > > > > > >> > sending
> > > > > > > >> > > > the
> > > > > > > >> > > > >> > proper LeaderAndIsr and UpdateMetadata requests.
> > For
> > > > > > example,
> > > > > > > >> > during
> > > > > > > >> > > > >> > controller failover, the full metadata needs to
> be
> > > sent
> > > > > > while
> > > > > > > >> > during
> > > > > > > >> > > > >> > individual broker failure, only some of the
> > metadata
> > > > > needs
> > > > > > to
> > > > > > > >> be
> > > > > > > >> > > > updated.
> > > > > > > >> > > > >> > The controlled shutdown handling sometimes uses
> > > > > > > >> StopReplicaRequest
> > > > > > > >> > > > and
> > > > > > > >> > > > >> > some other times uses LeaderAndIsrRequest. (3)
> > > > triggering
> > > > > > new
> > > > > > > >> > events
> > > > > > > >> > > > >> based
> > > > > > > >> > > > >> > on the responses of LeaderAndIsr (e.g. for topic
> > > > > deletion).
> > > > > > > >> Some
> > > > > > > >> > of
> > > > > > > >> > > > those
> > > > > > > >> > > > >> > complexity could be re-implemented in a more
> > > efficient
> > > > > way,
> > > > > > > >> but we
> > > > > > > >> > > > need
> > > > > > > >> > > > >> to
> > > > > > > >> > > > >> > be really careful not to generate regression.
> Some
> > of
> > > > the
> > > > > > > other
> > > > > > > >> > > > >> complexity
> > > > > > > >> > > > >> > just won't go away. Reimplementing all those
> logic
> > > for
> > > > > the
> > > > > > 30
> > > > > > > >> or
> > > > > > > >> > so
> > > > > > > >> > > > >> events
> > > > > > > >> > > > >> > in the ZK-based controller is possible, but
> seems a
> > > bit
> > > > > > > >> daunting
> > > > > > > >> > and
> > > > > > > >> > > > >> risky.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> Hi Jun,
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> Thanks for the response.  I agree that there is
> some
> > > work
> > > > > > here
> > > > > > > >> but I
> > > > > > > >> > > > don't
> > > > > > > >> > > > >> think it's as bad as it might seem. Most of the
> code
> > > for
> > > > > > > writing
> > > > > > > >> to
> > > > > > > >> > ZK
> > > > > > > >> > > > can
> > > > > > > >> > > > >> be reused from the old controller. We should at
> least
> > > > > > evaluate
> > > > > > > >> using
> > > > > > > >> > > the
> > > > > > > >> > > > >> old ControllerChannelManager as well. I don't know
> if
> > > > we'll
> > > > > > end
> > > > > > > >> up
> > > > > > > >> > > > doing it
> > > > > > > >> > > > >> or not but it's a possibility. (The main reason to
> > not
> > > do
> > > > > it
> > > > > > is
> > > > > > > >> that
> > > > > > > >> > > the
> > > > > > > >> > > > >> response handling will be a bit different)
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> The question of what to do during a controller
> > failover
> > > > is
> > > > > an
> > > > > > > >> > > > interesting
> > > > > > > >> > > > >> one. Technically, we should be able to continue
> > sending
> > > > > > > >> incremental
> > > > > > > >> > > > updates
> > > > > > > >> > > > >> to the legacy nodes, for the same reason we can in
> > > KRaft
> > > > > > mode.
> > > > > > > >> > > However,
> > > > > > > >> > > > >> probably we should just copy what ZK mode does and
> > send
> > > > > them
> > > > > > a
> > > > > > > >> full
> > > > > > > >> > > > >> metadata update. This will allow us to have an easy
> > way
> > > > to
> > > > > > > handle
> > > > > > > >> > > > >> divergences caused by lost RPCs by changing the
> > > > controller
> > > > > > > (just
> > > > > > > >> as
> > > > > > > >> > we
> > > > > > > >> > > > do
> > > > > > > >> > > > >> in ZK mode, unfortunately). I suppose we should
> > > document
> > > > > this
> > > > > > > in
> > > > > > > >> the
> > > > > > > >> > > > KIP...
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> I agree controlled shutdown is tricky to get just
> > > right.
> > > > I
> > > > > > > >> suppose
> > > > > > > >> > > this
> > > > > > > >> > > > is
> > > > > > > >> > > > >> a case where the RPCs we send out are not purely
> > "fire
> > > > and
> > > > > > > >> forget";
> > > > > > > >> > we
> > > > > > > >> > > > have
> > > > > > > >> > > > >> to listen for the response. But that can be done in
> > an
> > > > > > > >> event-based
> > > > > > > >> > > way.
> > > > > > > >> > > > >> Controlled shutdown is also probably the last thing
> > > we'll
> > > > > > > >> implement
> > > > > > > >> > > > once we
> > > > > > > >> > > > >> have the basic lift and shift.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> Topic deletion is another annoying thing. I wonder
> if
> > > we
> > > > > > could
> > > > > > > >> just
> > > > > > > >> > > > delete
> > > > > > > >> > > > >> immediately here and skip the complexity of
> > > implementing
> > > > > > > >> "deleting
> > > > > > > >> > > > state."
> > > > > > > >> > > > >> Topic IDs will exist in IBP 3.4, in both ZK and
> KRaft
> > > > mode,
> > > > > > so
> > > > > > > it
> > > > > > > >> > > > should be
> > > > > > > >> > > > >> OK, right?
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> best,
> > > > > > > >> > > > >> Colin
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > Thanks,
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > Jun
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > On Fri, Oct 14, 2022 at 9:29 AM Colin McCabe <
> > > > > > > >> cmcc...@apache.org>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> >> On Thu, Oct 13, 2022, at 11:44, Jun Rao wrote:
> > > > > > > >> > > > >> >> > Hi, Colin,
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> > Thanks for the reply.
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> > 10. This is a bit on the implementation side.
> If
> > > you
> > > > > > look
> > > > > > > at
> > > > > > > >> > the
> > > > > > > >> > > > >> existing
> > > > > > > >> > > > >> >> > ZK-based controller, most of the logic is
> around
> > > > > > > >> maintaining an
> > > > > > > >> > > > >> in-memory
> > > > > > > >> > > > >> >> > state of all the resources (broker, topic,
> > > > partition,
> > > > > > > etc),
> > > > > > > >> > > > >> >> reading/writing
> > > > > > > >> > > > >> >> > to ZK, sending LeaderAndIsr and UpdateMetadata
> > > > > requests
> > > > > > > and
> > > > > > > >> > > > handling
> > > > > > > >> > > > >> the
> > > > > > > >> > > > >> >> > responses to brokers. So we need all that
> logic
> > in
> > > > the
> > > > > > > dual
> > > > > > > >> > write
> > > > > > > >> > > > >> mode.
> > > > > > > >> > > > >> >> One
> > > > > > > >> > > > >> >> > option is to duplicate all that logic in some
> > new
> > > > > code.
> > > > > > > This
> > > > > > > >> > can
> > > > > > > >> > > > be a
> > > > > > > >> > > > >> bit
> > > > > > > >> > > > >> >> > error prone and makes the code a bit harder to
> > > > > maintain
> > > > > > if
> > > > > > > >> we
> > > > > > > >> > > need
> > > > > > > >> > > > to
> > > > > > > >> > > > >> fix
> > > > > > > >> > > > >> >> > some critical issues in ZK-based controllers.
> > > > Another
> > > > > > > >> option is
> > > > > > > >> > > to
> > > > > > > >> > > > try
> > > > > > > >> > > > >> >> > reusing the existing code in the ZK-based
> > > > controller.
> > > > > > For
> > > > > > > >> > > example,
> > > > > > > >> > > > we
> > > > > > > >> > > > >> >> could
> > > > > > > >> > > > >> >> > start the EventManager in the ZK-based
> > controller,
> > > > but
> > > > > > let
> > > > > > > >> the
> > > > > > > >> > > > KRaft
> > > > > > > >> > > > >> >> > controller ingest new events. This has its own
> > > > > > challenges:
> > > > > > > >> (1)
> > > > > > > >> > > the
> > > > > > > >> > > > >> >> existing
> > > > > > > >> > > > >> >> > logic only logs ZK failures and doesn't expose
> > > them
> > > > to
> > > > > > the
> > > > > > > >> > > caller;
> > > > > > > >> > > > (2)
> > > > > > > >> > > > >> >> the
> > > > > > > >> > > > >> >> > existing logic may add new events to the queue
> > > > itself
> > > > > > and
> > > > > > > we
> > > > > > > >> > > > probably
> > > > > > > >> > > > >> >> need
> > > > > > > >> > > > >> >> > to think through how this is coordinated with
> > the
> > > > > KRaft
> > > > > > > >> > > controller;
> > > > > > > >> > > > >> (3)
> > > > > > > >> > > > >> >> it
> > > > > > > >> > > > >> >> > registers some ZK listeners unnecessarily (may
> > not
> > > > be
> > > > > a
> > > > > > > big
> > > > > > > >> > > > concern).
> > > > > > > >> > > > >> So
> > > > > > > >> > > > >> >> we
> > > > > > > >> > > > >> >> > need to get around those issues somehow. I am
> > > > > wondering
> > > > > > if
> > > > > > > >> we
> > > > > > > >> > > have
> > > > > > > >> > > > >> >> > considered both options and which approach we
> > are
> > > > > > leaning
> > > > > > > >> > towards
> > > > > > > >> > > > for
> > > > > > > >> > > > >> the
> > > > > > > >> > > > >> >> > implementation.
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >> >> Yes, this is a good question. My take is that a
> > big
> > > > part
> > > > > > of
> > > > > > > >> the
> > > > > > > >> > > > >> complexity
> > > > > > > >> > > > >> >> in the old controller code results from the fact
> > > that
> > > > we
> > > > > > use
> > > > > > > >> ZK
> > > > > > > >> > as
> > > > > > > >> > > a
> > > > > > > >> > > > >> >> multi-writer database for propagating
> information
> > > > > between
> > > > > > > >> > different
> > > > > > > >> > > > >> >> components. So in the old controller, every
> write
> > to
> > > > ZK
> > > > > > > needs
> > > > > > > >> to
> > > > > > > >> > be
> > > > > > > >> > > > >> >> structured as a compare and swap to be fully
> > > correct.
> > > > > > Every
> > > > > > > >> time
> > > > > > > >> > we
> > > > > > > >> > > > get
> > > > > > > >> > > > >> >> notified about something, it's usually in the
> form
> > > of
> > > > > > "this
> > > > > > > >> znode
> > > > > > > >> > > > >> changed"
> > > > > > > >> > > > >> >> which prompts a full reload of part of the data
> in
> > > ZK
> > > > > > (which
> > > > > > > >> > itself
> > > > > > > >> > > > has
> > > > > > > >> > > > >> >> multiple parts, loading, deserializing,
> > reconciling,
> > > > > etc.)
> > > > > > > >> That
> > > > > > > >> > all
> > > > > > > >> > > > goes
> > > > > > > >> > > > >> >> away in the new mode, and we just have some code
> > > which
> > > > > > > >> analyzes
> > > > > > > >> > > > >> >> __cluster_metadata and reflects it in 1) updates
> > to
> > > ZK
> > > > > and
> > > > > > > 2)
> > > > > > > >> > > > messages
> > > > > > > >> > > > >> sent
> > > > > > > >> > > > >> >> out to brokers.
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >> >> This is pretty decoupled from the other logic in
> > > > > > > >> QuorumController
> > > > > > > >> > > and
> > > > > > > >> > > > >> >> should be easy to unit test, since the same
> inputs
> > > > from
> > > > > > the
> > > > > > > >> log
> > > > > > > >> > > > always
> > > > > > > >> > > > >> >> produce the same output in ZK. Basically, ZK is
> > > > > write-only
> > > > > > > for
> > > > > > > >> > us,
> > > > > > > >> > > > we do
> > > > > > > >> > > > >> >> not read it (with the big exception of broker
> > > > > registration
> > > > > > > >> > znodes)
> > > > > > > >> > > > and I
> > > > > > > >> > > > >> >> think that will greatly simplify things.
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >> >> So I think dual-write mode as described here
> will
> > be
> > > > > > > >> > substantially
> > > > > > > >> > > > >> simpler
> > > > > > > >> > > > >> >> than trying to run part or all of the old
> > controller
> > > > in
> > > > > > > >> > parallel. I
> > > > > > > >> > > > do
> > > > > > > >> > > > >> >> think we will reuse a bunch of the
> serialization /
> > > > > > > >> > deserialization
> > > > > > > >> > > > code
> > > > > > > >> > > > >> for
> > > > > > > >> > > > >> >> znodes and possibly the code for communicating
> > with
> > > > ZK.
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >> >> best,
> > > > > > > >> > > > >> >> Colin
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> > 14. Good point and make sense.
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> > Thanks,
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> > Jun
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <
> > > > > > > >> > cmcc...@apache.org
> > > > > > > >> > > >
> > > > > > > >> > > > >> wrote:
> > > > > > > >> > > > >> >> >
> > > > > > > >> > > > >> >> >> Hi Jun,
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> Thanks for taking a look. I can answer some
> > > > questions
> > > > > > > here
> > > > > > > >> > > > because I
> > > > > > > >> > > > >> >> >> collaborated on this a bit, and David is on
> > > > vacation
> > > > > > for
> > > > > > > a
> > > > > > > >> few
> > > > > > > >> > > > days.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> On Wed, Oct 12, 2022, at 14:41, Jun Rao
> wrote:
> > > > > > > >> > > > >> >> >> > Hi, David,
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > Thanks for the KIP. A few comments below.
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > 10. It's still not very clear to me how the
> > > KRaft
> > > > > > > >> controller
> > > > > > > >> > > > works
> > > > > > > >> > > > >> in
> > > > > > > >> > > > >> >> the
> > > > > > > >> > > > >> >> >> > dual writes mode to KRaft log and ZK when
> the
> > > > > brokers
> > > > > > > >> still
> > > > > > > >> > > run
> > > > > > > >> > > > in
> > > > > > > >> > > > >> ZK
> > > > > > > >> > > > >> >> >> mode.
> > > > > > > >> > > > >> >> >> > Does the KRaft controller run a ZK based
> > > > controller
> > > > > > in
> > > > > > > >> > > parallel
> > > > > > > >> > > > or
> > > > > > > >> > > > >> do
> > > > > > > >> > > > >> >> we
> > > > > > > >> > > > >> >> >> > derive what needs to be written to ZK based
> > on
> > > > > KRaft
> > > > > > > >> > > controller
> > > > > > > >> > > > >> logic?
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> We derive what needs to be written to ZK
> based
> > on
> > > > > KRaft
> > > > > > > >> > > controller
> > > > > > > >> > > > >> >> logic.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> > I am also not sure how the KRaft controller
> > > > handles
> > > > > > > >> broker
> > > > > > > >> > > > >> >> >> > registration/deregistration, since brokers
> > are
> > > > > still
> > > > > > > >> running
> > > > > > > >> > > in
> > > > > > > >> > > > ZK
> > > > > > > >> > > > >> >> mode
> > > > > > > >> > > > >> >> >> and
> > > > > > > >> > > > >> >> >> > are not heartbeating to the KRaft
> controller.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> The new controller will listen for broker
> > > > > registrations
> > > > > > > >> under
> > > > > > > >> > > > >> /brokers.
> > > > > > > >> > > > >> >> >> This is the only znode watch that the new
> > > > controller
> > > > > > will
> > > > > > > >> do.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> We did consider changing how ZK-based broker
> > > > > > registration
> > > > > > > >> > > worked,
> > > > > > > >> > > > >> but it
> > > > > > > >> > > > >> >> >> just ended up being too much work for not
> > enough
> > > > > gain.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > 12. "A new set of nodes will be provisioned
> > to
> > > > host
> > > > > > the
> > > > > > > >> > > > controller
> > > > > > > >> > > > >> >> >> quorum."
> > > > > > > >> > > > >> >> >> > I guess we don't support starting the KRaft
> > > > > > controller
> > > > > > > >> > quorum
> > > > > > > >> > > on
> > > > > > > >> > > > >> >> existing
> > > > > > > >> > > > >> >> >> > brokers. It would be useful to make that
> > clear.
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> Agreed
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> > 13. "Once the quorum is established and a
> > > leader
> > > > is
> > > > > > > >> elected,
> > > > > > > >> > > the
> > > > > > > >> > > > >> >> >> controller
> > > > > > > >> > > > >> >> >> > will check the state of the cluster using
> the
> > > > > > > >> MigrationCheck
> > > > > > > >> > > > RPC."
> > > > > > > >> > > > >> How
> > > > > > > >> > > > >> >> >> does
> > > > > > > >> > > > >> >> >> > the quorum controller detect other brokers?
> > > Does
> > > > > the
> > > > > > > >> > > controller
> > > > > > > >> > > > >> node
> > > > > > > >> > > > >> >> need
> > > > > > > >> > > > >> >> >> > to be configured with ZK connection string?
> > If
> > > > so,
> > > > > it
> > > > > > > >> would
> > > > > > > >> > be
> > > > > > > >> > > > >> useful
> > > > > > > >> > > > >> >> to
> > > > > > > >> > > > >> >> >> > document the additional configs that the
> > quorum
> > > > > > > >> controller
> > > > > > > >> > > > needs to
> > > > > > > >> > > > >> >> set.
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> Yes, the controllers monitor ZK for broker
> > > > > > registrations,
> > > > > > > >> as I
> > > > > > > >> > > > >> mentioned
> > > > > > > >> > > > >> >> >> above. So they need zk.connect and the other
> ZK
> > > > > > > connection
> > > > > > > >> > > > >> >> configurations.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> > 14. "In order to prevent further writes to
> > ZK,
> > > > the
> > > > > > > first
> > > > > > > >> > thing
> > > > > > > >> > > > the
> > > > > > > >> > > > >> new
> > > > > > > >> > > > >> >> >> > KRaft quorum must do is take over
> leadership
> > of
> > > > the
> > > > > > ZK
> > > > > > > >> > > > controller.
> > > > > > > >> > > > >> "
> > > > > > > >> > > > >> >> The
> > > > > > > >> > > > >> >> >> ZK
> > > > > > > >> > > > >> >> >> > controller processing changes to
> /controller
> > > > update
> > > > > > > >> > > > asynchronously.
> > > > > > > >> > > > >> >> How
> > > > > > > >> > > > >> >> >> > does the KRaft controller know when the ZK
> > > > > controller
> > > > > > > has
> > > > > > > >> > > > resigned
> > > > > > > >> > > > >> >> before
> > > > > > > >> > > > >> >> >> > it can safely copy the ZK data?
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> This should be done through
> > > > > > > >> expectedControllerEpochZkVersion,
> > > > > > > >> > > just
> > > > > > > >> > > > >> like
> > > > > > > >> > > > >> >> in
> > > > > > > >> > > > >> >> >> ZK mode, right? We should bump this epoch
> value
> > > so
> > > > > that
> > > > > > > any
> > > > > > > >> > > writes
> > > > > > > >> > > > >> from
> > > > > > > >> > > > >> >> the
> > > > > > > >> > > > >> >> >> old controller will not go through. I agree
> we
> > > > should
> > > > > > > spell
> > > > > > > >> > this
> > > > > > > >> > > > out
> > > > > > > >> > > > >> in
> > > > > > > >> > > > >> >> the
> > > > > > > >> > > > >> >> >> KIP.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> > 15. We have the following sentences. One
> says
> > > > > > > >> ControllerId
> > > > > > > >> > is
> > > > > > > >> > > a
> > > > > > > >> > > > >> random
> > > > > > > >> > > > >> >> >> > KRaft broker and the other says it's the
> > active
> > > > > > > >> controller.
> > > > > > > >> > > > Which
> > > > > > > >> > > > >> one
> > > > > > > >> > > > >> >> is
> > > > > > > >> > > > >> >> >> > correct?
> > > > > > > >> > > > >> >> >> > "UpdateMetadata: for certain metadata
> > changes,
> > > > the
> > > > > > > KRaft
> > > > > > > >> > > > controller
> > > > > > > >> > > > >> >> will
> > > > > > > >> > > > >> >> >> > need to send UpdateMetadataRequests to the
> ZK
> > > > > > brokers.
> > > > > > > >> For
> > > > > > > >> > the
> > > > > > > >> > > > >> >> >> > “ControllerId” field in this request, the
> > > > > controller
> > > > > > > >> should
> > > > > > > >> > > > >> specify a
> > > > > > > >> > > > >> >> >> > random KRaft broker."
> > > > > > > >> > > > >> >> >> > "In the UpdateMetadataRequest sent by the
> > KRaft
> > > > > > > >> controller
> > > > > > > >> > to
> > > > > > > >> > > > the
> > > > > > > >> > > > >> ZK
> > > > > > > >> > > > >> >> >> > brokers, the ControllerId will point to the
> > > > active
> > > > > > > >> > controller
> > > > > > > >> > > > which
> > > > > > > >> > > > >> >> will
> > > > > > > >> > > > >> >> >> be
> > > > > > > >> > > > >> >> >> > used for the inter-broker requests."
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> Yeah, this seems like an error to me as
> well. A
> > > > > random
> > > > > > > >> value
> > > > > > > >> > is
> > > > > > > >> > > > not
> > > > > > > >> > > > >> >> really
> > > > > > > >> > > > >> >> >> useful. Plus the text here is
> > self-contradictory,
> > > > as
> > > > > > you
> > > > > > > >> > pointed
> > > > > > > >> > > > out.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> I suspect what we should do here is add a new
> > > > field,
> > > > > > > >> > > > >> KRaftControllerId,
> > > > > > > >> > > > >> >> >> and populate it with the real controller ID,
> > and
> > > > > leave
> > > > > > > the
> > > > > > > >> old
> > > > > > > >> > > > >> >> controllerId
> > > > > > > >> > > > >> >> >> field as -1. A ZK-based broker that sees this
> > can
> > > > > then
> > > > > > > >> consult
> > > > > > > >> > > its
> > > > > > > >> > > > >> >> >> controller.quorum.voters configuration to see
> > > where
> > > > > it
> > > > > > > >> should
> > > > > > > >> > > send
> > > > > > > >> > > > >> >> >> controller-bound RPCs. That (static)
> > > configuration
> > > > > lets
> > > > > > > us
> > > > > > > >> map
> > > > > > > >> > > > >> between
> > > > > > > >> > > > >> >> >> controller ID and host:port.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> We should still keep our existing epoch logic
> > for
> > > > > > > deciding
> > > > > > > >> > when
> > > > > > > >> > > > >> >> >> UpdateMetadataRequest / LeaderAndIsrRequests
> > are
> > > > > stale,
> > > > > > > >> with
> > > > > > > >> > the
> > > > > > > >> > > > >> caveat
> > > > > > > >> > > > >> >> >> that any kraft-based epoch should be treated
> as
> > > > > greater
> > > > > > > >> than
> > > > > > > >> > any
> > > > > > > >> > > > >> >> ZK-based
> > > > > > > >> > > > >> >> >> epoch. After all, the kraft epoch is coming
> > from
> > > > the
> > > > > > > epoch
> > > > > > > >> of
> > > > > > > >> > > > >> >> >> __cluster_metadata, whereas the ZK epoch
> comes
> > > from
> > > > > ZK.
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > 16. "Additionally, the controller must
> > specify
> > > > if a
> > > > > > > >> broker
> > > > > > > >> > in
> > > > > > > >> > > > >> >> >> “LiveBrokers”
> > > > > > > >> > > > >> >> >> > is KRaft or ZK." Does that require any
> > protocol
> > > > > > changes
> > > > > > > >> to
> > > > > > > >> > > > >> >> >> UpdateMetadata?
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> Yeah, I am also curious why the we need to
> care
> > > > > whether
> > > > > > > >> > brokers
> > > > > > > >> > > > are
> > > > > > > >> > > > >> ZK
> > > > > > > >> > > > >> >> or
> > > > > > > >> > > > >> >> >> KRaft in UpdateMetadataRequest. We don't
> reveal
> > > > this
> > > > > to
> > > > > > > >> > clients,
> > > > > > > >> > > > so
> > > > > > > >> > > > >> can
> > > > > > > >> > > > >> >> we
> > > > > > > >> > > > >> >> >> just leave this out?
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> best,
> > > > > > > >> > > > >> >> >> Colin
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >> >> > Thanks,
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > Jun
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > On Wed, Oct 5, 2022 at 10:07 AM Mickael
> > Maison
> > > <
> > > > > > > >> > > > >> >> mickael.mai...@gmail.com
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> > wrote:
> > > > > > > >> > > > >> >> >> >
> > > > > > > >> > > > >> >> >> >> Hi David,
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >> >> Thanks for starting this important KIP.
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >> >> I've just taken a quick look so far but
> I've
> > > > got a
> > > > > > > >> couple
> > > > > > > >> > of
> > > > > > > >> > > > >> initial
> > > > > > > >> > > > >> >> >> >> questions:
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >> >> 1) What happens if a non KRaft compatible
> > > broker
> > > > > (or
> > > > > > > >> with
> > > > > > > >> > > > >> >> >> >> kafka.metadata.migration.enable set to
> > false)
> > > > > joins
> > > > > > > the
> > > > > > > >> > > cluster
> > > > > > > >> > > > >> after
> > > > > > > >> > > > >> >> >> >> the migration is triggered?
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >> >> 2) In the Failure Modes section you
> mention
> > a
> > > > > > scenario
> > > > > > > >> > where
> > > > > > > >> > > a
> > > > > > > >> > > > >> write
> > > > > > > >> > > > >> >> >> >> to ZK fails. What happens when the
> > divergence
> > > > > limit
> > > > > > is
> > > > > > > >> > > > reached? Is
> > > > > > > >> > > > >> >> >> >> this a fatal condition? How much
> divergence
> > > > should
> > > > > > we
> > > > > > > >> > allow?
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >> >> Thanks,
> > > > > > > >> > > > >> >> >> >> Mickael
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM David
> > Arthur <
> > > > > > > >> > > mum...@gmail.com
> > > > > > > >> > > > >
> > > > > > > >> > > > >> >> wrote:
> > > > > > > >> > > > >> >> >> >> >
> > > > > > > >> > > > >> >> >> >> > Hey folks, I wanted to get the ball
> > rolling
> > > on
> > > > > the
> > > > > > > >> > > discussion
> > > > > > > >> > > > >> for
> > > > > > > >> > > > >> >> the
> > > > > > > >> > > > >> >> >> >> > ZooKeeper migration KIP. This KIP
> details
> > > how
> > > > we
> > > > > > > plan
> > > > > > > >> to
> > > > > > > >> > do
> > > > > > > >> > > > an
> > > > > > > >> > > > >> >> online
> > > > > > > >> > > > >> >> >> >> > migration of metadata from ZooKeeper to
> > > KRaft
> > > > as
> > > > > > > well
> > > > > > > >> as
> > > > > > > >> > a
> > > > > > > >> > > > >> rolling
> > > > > > > >> > > > >> >> >> >> > upgrade of brokers to KRaft mode.
> > > > > > > >> > > > >> >> >> >> >
> > > > > > > >> > > > >> >> >> >> > The general idea is to keep KRaft and
> > > > ZooKeeper
> > > > > in
> > > > > > > >> sync
> > > > > > > >> > > > during
> > > > > > > >> > > > >> the
> > > > > > > >> > > > >> >> >> >> > migration, so both types of brokers can
> > > exist
> > > > > > > >> > > simultaneously.
> > > > > > > >> > > > >> Then,
> > > > > > > >> > > > >> >> >> >> > once everything is migrated and updated,
> > we
> > > > can
> > > > > > turn
> > > > > > > >> off
> > > > > > > >> > > > >> ZooKeeper
> > > > > > > >> > > > >> >> >> >> > writes.
> > > > > > > >> > > > >> >> >> >> >
> > > > > > > >> > > > >> >> >> >> > This is a pretty complex KIP, so please
> > > take a
> > > > > > look
> > > > > > > :)
> > > > > > > >> > > > >> >> >> >> >
> > > > > > > >> > > > >> >> >> >> >
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >>
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
> > > > > > > >> > > > >> >> >> >> >
> > > > > > > >> > > > >> >> >> >> > Thanks!
> > > > > > > >> > > > >> >> >> >> > David
> > > > > > > >> > > > >> >> >> >>
> > > > > > > >> > > > >> >> >>
> > > > > > > >> > > > >> >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > --
> > > > > > > >> > > > > -David
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > --
> > > > > > > >> > > -David
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -David
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -David
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -David
> > > > >
> > > >
> > >
> > >
> > > --
> > > -David
> > >
> >
>
>
> --
> -David
>

Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Reply via email to