Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Igor Soarez Tue, 18 Oct 2022 09:16:54 -0700

Hi David,

Thanks for the KIP, this is very exciting!


How does JBOD relate to this work? KRaft mode doesn't yet support configuring 
brokers with multiple log directories. If the brokers in the existing cluster 
are configured with multiple log dirs, does the migration imply that the 
existing brokers need to drop use of that feature? Or is there some way to 
upgrade them later?

Thanks,

--
Igor

On Mon, Oct 17, 2022, at 10:07 PM, David Arthur wrote:
> I've updated the KIP with the following changes (the confluence diff is not
> helpful here since i rearranged some things)
>
> * Added ZooKeeperBlockingKRaftMillis metric
> * Added section on new broker registration JSON
> * Removed section on MigrationCheck RPC
> * Added change to UpdateMetadataRequest
> * Added section "Additional ZK Broker Configs" (includes configs to connect
> to KRaft quorum)
> * Added section on "Incompatible Brokers" under Failure Modes
> * Clarified many things per this discussion thread
>
> I realized we need the KRaft controller to pick the correct
> "metadata.version" when initializing the migration. I included the IBP of a
> broker in its registration data so the KRaft controller can verify the IBP
> and pick the correct "metadata.version" when starting the migration.
> Otherwise, we could inadvertently downgrade the IBP/metadata.version as
> part of the migration.
>
> I also added "clusterId" to the broker registration data so the KRaft could
> verify it.
>
> -David
>
>
>
> On Mon, Oct 17, 2022 at 12:14 PM Colin McCabe <cmcc...@apache.org> wrote:
>
>> On Fri, Oct 14, 2022, at 11:39, Jun Rao wrote:
>> > Hi, Colin,
>> >
>> > 10. "That all goes away in the new mode, and we just have some code which
>> > analyzes __cluster_metadata and reflects it in 1) updates to ZK and 2)
>> > messages sent out to brokers."
>> > Hmm, I am not sure it's that simple. Some of the complexity of the
>> ZK-based
>> > controller are (1) using the pipelining approach to write to ZK for
>> better
>> > throughput and using conditional writes for correctness; (2) sending the
>> > proper LeaderAndIsr and UpdateMetadata requests. For example, during
>> > controller failover, the full metadata needs to be sent while during
>> > individual broker failure, only some of the metadata needs to be updated.
>> > The controlled shutdown handling sometimes uses StopReplicaRequest  and
>> > some other times uses LeaderAndIsrRequest. (3) triggering new events
>> based
>> > on the responses of LeaderAndIsr (e.g. for topic deletion). Some of those
>> > complexity could be re-implemented in a more efficient way, but we need
>> to
>> > be really careful not to generate regression. Some of the other
>> complexity
>> > just won't go away. Reimplementing all those logic for the 30 or so
>> events
>> > in the ZK-based controller is possible, but seems a bit daunting and
>> risky.
>>
>> Hi Jun,
>>
>> Thanks for the response.  I agree that there is some work here but I don't
>> think it's as bad as it might seem. Most of the code for writing to ZK can
>> be reused from the old controller. We should at least evaluate using the
>> old ControllerChannelManager as well. I don't know if we'll end up doing it
>> or not but it's a possibility. (The main reason to not do it is that the
>> response handling will be a bit different)
>>
>> The question of what to do during a controller failover is an interesting
>> one. Technically, we should be able to continue sending incremental updates
>> to the legacy nodes, for the same reason we can in KRaft mode. However,
>> probably we should just copy what ZK mode does and send them a full
>> metadata update. This will allow us to have an easy way to handle
>> divergences caused by lost RPCs by changing the controller (just as we do
>> in ZK mode, unfortunately). I suppose we should document this in the KIP...
>>
>> I agree controlled shutdown is tricky to get just right. I suppose this is
>> a case where the RPCs we send out are not purely "fire and forget"; we have
>> to listen for the response. But that can be done in an event-based way.
>> Controlled shutdown is also probably the last thing we'll implement once we
>> have the basic lift and shift.
>>
>> Topic deletion is another annoying thing. I wonder if we could just delete
>> immediately here and skip the complexity of implementing "deleting state."
>> Topic IDs will exist in IBP 3.4, in both ZK and KRaft mode, so it should be
>> OK, right?
>>
>> best,
>> Colin
>>
>>
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Fri, Oct 14, 2022 at 9:29 AM Colin McCabe <cmcc...@apache.org> wrote:
>> >
>> >> On Thu, Oct 13, 2022, at 11:44, Jun Rao wrote:
>> >> > Hi, Colin,
>> >> >
>> >> > Thanks for the reply.
>> >> >
>> >> > 10. This is a bit on the implementation side. If you look at the
>> existing
>> >> > ZK-based controller, most of the logic is around maintaining an
>> in-memory
>> >> > state of all the resources (broker, topic, partition, etc),
>> >> reading/writing
>> >> > to ZK, sending LeaderAndIsr and UpdateMetadata requests and handling
>> the
>> >> > responses to brokers. So we need all that logic in the dual write
>> mode.
>> >> One
>> >> > option is to duplicate all that logic in some new code. This can be a
>> bit
>> >> > error prone and makes the code a bit harder to maintain if we need to
>> fix
>> >> > some critical issues in ZK-based controllers. Another option is to try
>> >> > reusing the existing code in the ZK-based controller. For example, we
>> >> could
>> >> > start the EventManager in the ZK-based controller, but let the KRaft
>> >> > controller ingest new events. This has its own challenges: (1) the
>> >> existing
>> >> > logic only logs ZK failures and doesn't expose them to the caller; (2)
>> >> the
>> >> > existing logic may add new events to the queue itself and we probably
>> >> need
>> >> > to think through how this is coordinated with the KRaft controller;
>> (3)
>> >> it
>> >> > registers some ZK listeners unnecessarily (may not be a big concern).
>> So
>> >> we
>> >> > need to get around those issues somehow. I am wondering if we have
>> >> > considered both options and which approach we are leaning towards for
>> the
>> >> > implementation.
>> >> >
>> >>
>> >> Yes, this is a good question. My take is that a big part of the
>> complexity
>> >> in the old controller code results from the fact that we use ZK as a
>> >> multi-writer database for propagating information between different
>> >> components. So in the old controller, every write to ZK needs to be
>> >> structured as a compare and swap to be fully correct. Every time we get
>> >> notified about something, it's usually in the form of "this znode
>> changed"
>> >> which prompts a full reload of part of the data in ZK (which itself has
>> >> multiple parts, loading, deserializing, reconciling, etc.) That all goes
>> >> away in the new mode, and we just have some code which analyzes
>> >> __cluster_metadata and reflects it in 1) updates to ZK and 2) messages
>> sent
>> >> out to brokers.
>> >>
>> >> This is pretty decoupled from the other logic in QuorumController and
>> >> should be easy to unit test, since the same inputs from the log always
>> >> produce the same output in ZK. Basically, ZK is write-only for us, we do
>> >> not read it (with the big exception of broker registration znodes) and I
>> >> think that will greatly simplify things.
>> >>
>> >> So I think dual-write mode as described here will be substantially
>> simpler
>> >> than trying to run part or all of the old controller in parallel. I do
>> >> think we will reuse a bunch of the serialization / deserialization code
>> for
>> >> znodes and possibly the code for communicating with ZK.
>> >>
>> >> best,
>> >> Colin
>> >>
>> >>
>> >> >
>> >> > 14. Good point and make sense.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Jun
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Oct 12, 2022 at 3:27 PM Colin McCabe <cmcc...@apache.org>
>> wrote:
>> >> >
>> >> >> Hi Jun,
>> >> >>
>> >> >> Thanks for taking a look. I can answer some questions here because I
>> >> >> collaborated on this a bit, and David is on vacation for a few days.
>> >> >>
>> >> >> On Wed, Oct 12, 2022, at 14:41, Jun Rao wrote:
>> >> >> > Hi, David,
>> >> >> >
>> >> >> > Thanks for the KIP. A few comments below.
>> >> >> >
>> >> >> > 10. It's still not very clear to me how the KRaft controller works
>> in
>> >> the
>> >> >> > dual writes mode to KRaft log and ZK when the brokers still run in
>> ZK
>> >> >> mode.
>> >> >> > Does the KRaft controller run a ZK based controller in parallel or
>> do
>> >> we
>> >> >> > derive what needs to be written to ZK based on KRaft controller
>> logic?
>> >> >>
>> >> >> We derive what needs to be written to ZK based on KRaft controller
>> >> logic.
>> >> >>
>> >> >> > I am also not sure how the KRaft controller handles broker
>> >> >> > registration/deregistration, since brokers are still running in ZK
>> >> mode
>> >> >> and
>> >> >> > are not heartbeating to the KRaft controller.
>> >> >>
>> >> >> The new controller will listen for broker registrations under
>> /brokers.
>> >> >> This is the only znode watch that the new controller will do.
>> >> >>
>> >> >> We did consider changing how ZK-based broker registration worked,
>> but it
>> >> >> just ended up being too much work for not enough gain.
>> >> >>
>> >> >> >
>> >> >> > 12. "A new set of nodes will be provisioned to host the controller
>> >> >> quorum."
>> >> >> > I guess we don't support starting the KRaft controller quorum on
>> >> existing
>> >> >> > brokers. It would be useful to make that clear.
>> >> >> >
>> >> >>
>> >> >> Agreed
>> >> >>
>> >> >> > 13. "Once the quorum is established and a leader is elected, the
>> >> >> controller
>> >> >> > will check the state of the cluster using the MigrationCheck RPC."
>> How
>> >> >> does
>> >> >> > the quorum controller detect other brokers? Does the controller
>> node
>> >> need
>> >> >> > to be configured with ZK connection string? If so, it would be
>> useful
>> >> to
>> >> >> > document the additional configs that the quorum controller needs to
>> >> set.
>> >> >> >
>> >> >>
>> >> >> Yes, the controllers monitor ZK for broker registrations, as I
>> mentioned
>> >> >> above. So they need zk.connect and the other ZK connection
>> >> configurations.
>> >> >>
>> >> >> > 14. "In order to prevent further writes to ZK, the first thing the
>> new
>> >> >> > KRaft quorum must do is take over leadership of the ZK controller.
>> "
>> >> The
>> >> >> ZK
>> >> >> > controller processing changes to /controller update asynchronously.
>> >> How
>> >> >> > does the KRaft controller know when the ZK controller has resigned
>> >> before
>> >> >> > it can safely copy the ZK data?
>> >> >> >
>> >> >>
>> >> >> This should be done through expectedControllerEpochZkVersion, just
>> like
>> >> in
>> >> >> ZK mode, right? We should bump this epoch value so that any writes
>> from
>> >> the
>> >> >> old controller will not go through. I agree we should spell this out
>> in
>> >> the
>> >> >> KIP.
>> >> >>
>> >> >> > 15. We have the following sentences. One says ControllerId is a
>> random
>> >> >> > KRaft broker and the other says it's the active controller. Which
>> one
>> >> is
>> >> >> > correct?
>> >> >> > "UpdateMetadata: for certain metadata changes, the KRaft controller
>> >> will
>> >> >> > need to send UpdateMetadataRequests to the ZK brokers. For the
>> >> >> > “ControllerId” field in this request, the controller should
>> specify a
>> >> >> > random KRaft broker."
>> >> >> > "In the UpdateMetadataRequest sent by the KRaft controller to the
>> ZK
>> >> >> > brokers, the ControllerId will point to the active controller which
>> >> will
>> >> >> be
>> >> >> > used for the inter-broker requests."
>> >> >> >
>> >> >>
>> >> >> Yeah, this seems like an error to me as well. A random value is not
>> >> really
>> >> >> useful. Plus the text here is self-contradictory, as you pointed out.
>> >> >>
>> >> >> I suspect what we should do here is add a new field,
>> KRaftControllerId,
>> >> >> and populate it with the real controller ID, and leave the old
>> >> controllerId
>> >> >> field as -1. A ZK-based broker that sees this can then consult its
>> >> >> controller.quorum.voters configuration to see where it should send
>> >> >> controller-bound RPCs. That (static) configuration lets us map
>> between
>> >> >> controller ID and host:port.
>> >> >>
>> >> >> We should still keep our existing epoch logic for deciding when
>> >> >> UpdateMetadataRequest / LeaderAndIsrRequests are stale, with the
>> caveat
>> >> >> that any kraft-based epoch should be treated as greater than any
>> >> ZK-based
>> >> >> epoch. After all, the kraft epoch is coming from the epoch of
>> >> >> __cluster_metadata, whereas the ZK epoch comes from ZK.
>> >> >>
>> >> >> >
>> >> >> > 16. "Additionally, the controller must specify if a broker in
>> >> >> “LiveBrokers”
>> >> >> > is KRaft or ZK." Does that require any protocol changes to
>> >> >> UpdateMetadata?
>> >> >> >
>> >> >>
>> >> >> Yeah, I am also curious why the we need to care whether brokers are
>> ZK
>> >> or
>> >> >> KRaft in UpdateMetadataRequest. We don't reveal this to clients, so
>> can
>> >> we
>> >> >> just leave this out?
>> >> >>
>> >> >> best,
>> >> >> Colin
>> >> >>
>> >> >> > Thanks,
>> >> >> >
>> >> >> > Jun
>> >> >> >
>> >> >> > On Wed, Oct 5, 2022 at 10:07 AM Mickael Maison <
>> >> mickael.mai...@gmail.com
>> >> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Hi David,
>> >> >> >>
>> >> >> >> Thanks for starting this important KIP.
>> >> >> >>
>> >> >> >> I've just taken a quick look so far but I've got a couple of
>> initial
>> >> >> >> questions:
>> >> >> >>
>> >> >> >> 1) What happens if a non KRaft compatible broker (or with
>> >> >> >> kafka.metadata.migration.enable set to false) joins the cluster
>> after
>> >> >> >> the migration is triggered?
>> >> >> >>
>> >> >> >> 2) In the Failure Modes section you mention a scenario where a
>> write
>> >> >> >> to ZK fails. What happens when the divergence limit is reached? Is
>> >> >> >> this a fatal condition? How much divergence should we allow?
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >> Mickael
>> >> >> >>
>> >> >> >> On Wed, Oct 5, 2022 at 12:20 AM David Arthur <mum...@gmail.com>
>> >> wrote:
>> >> >> >> >
>> >> >> >> > Hey folks, I wanted to get the ball rolling on the discussion
>> for
>> >> the
>> >> >> >> > ZooKeeper migration KIP. This KIP details how we plan to do an
>> >> online
>> >> >> >> > migration of metadata from ZooKeeper to KRaft as well as a
>> rolling
>> >> >> >> > upgrade of brokers to KRaft mode.
>> >> >> >> >
>> >> >> >> > The general idea is to keep KRaft and ZooKeeper in sync during
>> the
>> >> >> >> > migration, so both types of brokers can exist simultaneously.
>> Then,
>> >> >> >> > once everything is migrated and updated, we can turn off
>> ZooKeeper
>> >> >> >> > writes.
>> >> >> >> >
>> >> >> >> > This is a pretty complex KIP, so please take a look :)
>> >> >> >> >
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-866+ZooKeeper+to+KRaft+Migration
>> >> >> >> >
>> >> >> >> > Thanks!
>> >> >> >> > David
>> >> >> >>
>> >> >>
>> >>
>>
>
>
> -- 
> -David

Re: [DISCUSS] KIP-866 ZooKeeper to KRaft Migration

Reply via email to