I've updated the KIP with the following changes (the confluence diff is not
helpful here since i rearranged some things)

* Added ZooKeeperBlockingKRaftMillis metric
* Added section on new broker registration JSON
* Removed section on MigrationCheck RPC
* Added change to UpdateMetadataRequest
* Added section "Additional ZK Broker Configs" (includes configs to connect
to KRaft quorum)
* Added section on "Incompatible Brokers" under Failure Modes
* Clarified many things per this discussion thread

I realized we need the KRaft controller to pick the correct
"metadata.version" when initializing the migration. I included the IBP of a
broker in its registration data so the KRaft controller can verify the IBP
and pick the correct "metadata.version" when starting the migration.
Otherwise, we could inadvertently downgrade the IBP/metadata.version as
part of the migration.

I also added "clusterId" to the broker registration data so the KRaft could
verify it.

-David



On Mon, Oct 17, 2022 at 12:14 PM Colin McCabe <cmcc...@apache.org> wrote:

> On Fri, Oct 14, 2022, at 11:39, Jun Rao wrote:
> > Hi, Colin,
> >
> > 10. "That all goes away in the new mode, and we just have some code which
> > analyzes __cluster_metadata and reflects it in 1) updates to ZK and 2)
> > messages sent out to brokers."
> > Hmm, I am not sure it's that simple. Some of the complexity of the
> ZK-based
> > controller are (1) using the pipelining approach to write to ZK for
> better
> > throughput and using conditional writes for correctness; (2) sending the
> > proper LeaderAndIsr and UpdateMetadata requests. For example, during
> > controller failover, the full metadata needs to be sent while during
> > individual broker failure, only some of the metadata needs to be updated.
> > The controlled shutdown handling sometimes uses StopReplicaRequest  and
> > some other times uses LeaderAndIsrRequest. (3) triggering new events
> based
> > on the responses of LeaderAndIsr (e.g. for topic deletion). Some of those
> > complexity could be re-implemented in a more efficient way, but we need
> to
> > be really careful not to generate regression. Some of the other
> complexity
> > just won't go away. Reimplementing all those logic for the 30 or so
> events
> > in the ZK-based controller is possible, but seems a bit daunting and
> risky.
>
> Hi Jun,
>
> Thanks for the response.  I agree that there is some work here but I don't
> think it's as bad as it might seem. Most of the code for writing to ZK can
> be reused from the old controller. We should at least evaluate using the
> old ControllerChannelManager as well. I don't know if we'll end up doing it
> or not but it's a possibility. (The main reason to not do it is that the
> response handling will be a bit different)
>
> The question of what to do during a controller failover is an interesting
> one. Technically, we should be able to continue sending incremental updates
> to the legacy nodes, for the same reason we can in KRaft mode. However,
> probably we should just copy what ZK mode does and send them a full
> metadata update. This will allow us to have an easy way to handle
> divergences caused by lost RPCs by changing the controller (just as we do
> in ZK mode, unfortunately). I suppose we should document this in the KIP...
>
> I agree controlled shutdown is tricky to get just right. I suppose this is
> a case where the RPCs we send out are not purely "fire and forget"; we have
> to listen for the response. But that can be done in an event-based way.
> Controlled shutdown is also probably the last thing we'll implement once we
> have the basic lift and shift.
>
> Topic deletion is another annoying thing. I wonder if we could just delete
> immediately here and skip the complexity of implementing "deleting state."
> Topic IDs will exist in IBP 3.4, in both ZK and KRaft mode, so it should be
> OK, right?
>
> best,
> Colin
>
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Oct 14, 2022 at 9:29 AM Colin McCabe <cmcc...@apache.org> wrote:
> >
> >> On Thu, Oct 13, 2022, at 11:44, Jun Rao wrote:
> >> > Hi, Colin,
> >> >
> >> > Thanks for the reply.
> >> >
> >> > 10. This is a bit on the implementation side. If you look at the
> existing
> >> > ZK-based controller, most of the logic is around maintaining an
> in-memory
> >> > state of all the resources (broker, topic, partition, etc),
> >> reading/writing
> >> > to ZK, sending LeaderAndIsr and UpdateMetadata requests and handling
> the
> >> > responses to brokers. So we need all that logic in the dual write
> mode.
> >> One
> >> > option is to duplicate all that logic in some new code. This can be a
> bit
> >> > error prone and makes the code a bit harder to maintain if we need to
> fix
> >> > some critical issues in ZK-based controllers. Another option is to try
> >> > reusing the existing code in the ZK-based controller. For example, we
> >> could
> >> > start the EventManager in the ZK-based controller, but let the KRaft
> >> > controller ingest new events. This has its own challenges: (1) the
> >> existing
> >> > logic only logs ZK failures and doesn't expose them to the caller; (2)
> >> the
> >> > existing logic may add new events to the queue itself and we probably
> >> need
> >> > to think through how this is coordinated with the KRaft controller;
> (3)
> >> it
> >> > registers some ZK listeners unnecessarily (may not be a big concern).
> So
> >> we
> >> > need to get around those issues somehow. I am wondering if we have
> >> > considered both options and which approach we are leaning towards for
> the
> >> > implementation.
> >> >
> >>
> >> Yes, this is a good question. My take is that a big part of the
> complexity
> >> in the old controller code results from the fact that we use ZK as a
> >> multi-writer database for propagating information between different
> >> components. So in the old controller, every write to ZK needs to be
> >> structured as a compare and swap to be fully correct. Every time we get
> >> notified about something, it's usually in the form of "this znode
> changed"
> >> which prompts a full reload of part of the data in ZK (which itself has
> >> multiple parts, loading, deserializing, reconciling, etc.) That all goes
> >> away in the new mode, and we just have some code which analyzes
> >> __cluster_metadata and reflects it in 1) updates to ZK and 2) messages
> sent
> >> out to brokers.
> >>
> >> This is pretty decoupled from the other logic in QuorumController and
> >> should be easy to unit test, since the same inputs from the log always
> >> produce the same output in ZK. Basically, ZK is write-only for us, we do
> >> not read it (with the big exception of broker registration znodes) and I
> >> think that will greatly simplify things.
> >>
> >> So I think dual-write mode as described here will be substantially
> simpler
> >> than trying to run part or all of the old controller in parallel. I do
> >> think we will reuse a bunch of the serialization / deserialization code
> for
> >> znodes and possibly the code for communicating with ZK.
> >>
> >> best,
> >> Colin
> >>
> >>
> >> >
> >> > 14. Good point and make sense.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <cmcc...@apache.org>
> wrote:
> >> >
> >> >> Hi Jun,
> >> >>
> >> >> Thanks for taking a look. I can answer some questions here because I
> >> >> collaborated on this a bit, and David is on vacation for a few days.
> >> >>
> >> >> On Wed, Oct 12, 2022, at 14:41, Jun Rao wrote:
> >> >> > Hi, David,
> >> >> >
> >> >> > Thanks for the KIP. A few comments below.
> >> >> >
> >> >> > 10. It's still not very clear to me how the KRaft controller works
> in
> >> the
> >> >> > dual writes mode to KRaft log and ZK when the brokers still run in
> ZK
> >> >> mode.
> >> >> > Does the KRaft controller run a ZK based controller in parallel or
> do
> >> we
> >> >> > derive what needs to be written to ZK based on KRaft controller
> logic?
> >> >>
> >> >> We derive what needs to be written to ZK based on KRaft controller
> >> logic.
> >> >>
> >> >> > I am also not sure how the KRaft controller handles broker
> >> >> > registration/deregistration, since brokers are still running in ZK
> >> mode
> >> >> and
> >> >> > are not heartbeating to the KRaft controller.
> >> >>
> >> >> The new controller will listen for broker registrations under
> /brokers.
> >> >> This is the only znode watch that the new controller will do.
> >> >>
> >> >> We did consider changing how ZK-based broker registration worked,
> but it
> >> >> just ended up being too much work for not enough gain.
> >> >>
> >> >> >
> >> >> > 12. "A new set of nodes will be provisioned to host the controller
> >> >> quorum."
> >> >> > I guess we don't support starting the KRaft controller quorum on
> >> existing
> >> >> > brokers. It would be useful to make that clear.
> >> >> >
> >> >>
> >> >> Agreed
> >> >>
> >> >> > 13. "Once the quorum is established and a leader is elected, the
> >> >> controller
> >> >> > will check the state of the cluster using the MigrationCheck RPC."
> How
> >> >> does
> >> >> > the quorum controller detect other brokers? Does the controller
> node
> >> need
> >> >> > to be configured with ZK connection string? If so, it would be
> useful
> >> to
> >> >> > document the additional configs that the quorum controller needs to
> >> set.
> >> >> >
> >> >>
> >> >> Yes, the controllers monitor ZK for broker registrations, as I
> mentioned
> >> >> above. So they need zk.connect and the other ZK connection
> >> configurations.
> >> >>
> >> >> > 14. "In order to prevent further writes to ZK, the first thing the
> new
> >> >> > KRaft quorum must do is take over leadership of the ZK controller.
> "
> >> The
> >> >> ZK
> >> >> > controller processing changes to /controller update asynchronously.
> >> How
> >> >> > does the KRaft controller know when the ZK controller has resigned
> >> before
> >> >> > it can safely copy the ZK data?
> >> >> >
> >> >>
> >> >> This should be done through expectedControllerEpochZkVersion, just
> like
> >> in
> >> >> ZK mode, right? We should bump this epoch value so that any writes
> from
> >> the
> >> >> old controller will not go through. I agree we should spell this out
> in
> >> the
> >> >> KIP.
> >> >>
> >> >> > 15. We have the following sentences. One says ControllerId is a
> random
> >> >> > KRaft broker and the other says it's the active controller. Which
> one
> >> is
> >> >> > correct?
> >> >> > "UpdateMetadata: for certain metadata changes, the KRaft controller
> >> will
> >> >> > need to send UpdateMetadataRequests to the ZK brokers. For the
> >> >> > “ControllerId” field in this request, the controller should
> specify a
> >> >> > random KRaft broker."
> >> >> > "In the UpdateMetadataRequest sent by the KRaft controller to the
> ZK
> >> >> > brokers, the ControllerId will point to the active controller which
> >> will
> >> >> be
> >> >> > used for the inter-broker requests."
> >> >> >
> >> >>
> >> >> Yeah, this seems like an error to me as well. A random value is not
> >> really
> >> >> useful. Plus the text here is self-contradictory, as you pointed out.
> >> >>
> >> >> I suspect what we should do here is add a new field,
> KRaftControllerId,
> >> >> and populate it with the real controller ID, and leave the old
> >> controllerId
> >> >> field as -1. A ZK-based broker that sees this can then consult its
> >> >> controller.quorum.voters configuration to see where it should send
> >> >> controller-bound RPCs. That (static) configuration lets us map
> between
> >> >> controller ID and host:port.
> >> >>
> >> >> We should still keep our existing epoch logic for deciding when
> >> >> UpdateMetadataRequest / LeaderAndIsrRequests are stale, with the
> caveat
> >> >> that any kraft-based epoch should be treated as greater than any
> >> ZK-based
> >> >> epoch. After all, the kraft epoch is coming from the epoch of
> >> >> __cluster_metadata, whereas the ZK epoch comes from ZK.
> >> >>
> >> >> >
> >> >> > 16. "Additionally, the controller must specify if a broker in
> >> >> “LiveBrokers”
> >> >> > is KRaft or ZK." Does that require any protocol changes to
> >> >> UpdateMetadata?
> >> >> >
> >> >>
> >> >> Yeah, I am also curious why the we need to care whether brokers are
> ZK
> >> or
> >> >> KRaft in UpdateMetadataRequest. We don't reveal this to clients, so
> can
> >> we
> >> >> just leave this out?
> >> >>
> >> >> best,
> >> >> Colin
> >> >>
> >> >> > Thanks,
> >> >> >
> >> >> > Jun
> >> >> >
> >> >> > On Wed, Oct 5, 2022 at 10:07 AM Mickael Maison <
> >> mickael.mai...@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >> Hi David,
> >> >> >>
> >> >> >> Thanks for starting this important KIP.
> >> >> >>
> >> >> >> I've just taken a quick look so far but I've got a couple of
> initial
> >> >> >> questions:
> >> >> >>
> >> >> >> 1) What happens if a non KRaft compatible broker (or with
> >> >> >> kafka.metadata.migration.enable set to false) joins the cluster
> after
> >> >> >> the migration is triggered?
> >> >> >>
> >> >> >> 2) In the Failure Modes section you mention a scenario where a
> write
> >> >> >> to ZK fails. What happens when the divergence limit is reached? Is
> >> >> >> this a fatal condition? How much divergence should we allow?
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Mickael
> >> >> >>
> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM David Arthur <mum...@gmail.com>
> >> wrote:
> >> >> >> >
> >> >> >> > Hey folks, I wanted to get the ball rolling on the discussion
> for
> >> the
> >> >> >> > ZooKeeper migration KIP. This KIP details how we plan to do an
> >> online
> >> >> >> > migration of metadata from ZooKeeper to KRaft as well as a
> rolling
> >> >> >> > upgrade of brokers to KRaft mode.
> >> >> >> >
> >> >> >> > The general idea is to keep KRaft and ZooKeeper in sync during
> the
> >> >> >> > migration, so both types of brokers can exist simultaneously.
> Then,
> >> >> >> > once everything is migrated and updated, we can turn off
> ZooKeeper
> >> >> >> > writes.
> >> >> >> >
> >> >> >> > This is a pretty complex KIP, so please take a look :)
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
> >> >> >> >
> >> >> >> > Thanks!
> >> >> >> > David
> >> >> >>
> >> >>
> >>
>


-- 
-David

Reply via email to