Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Boyang Chen Sat, 03 Aug 2019 21:58:26 -0700

Thanks for explaining Ismael! Breaking down into follow-up KIPs sounds like
a good idea.


On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <ism...@juma.me.uk> wrote:

> Hi Boyang,
>
> Yes, there will be several KIPs that will discuss the items you describe in
> detail. Colin, it may be helpful to make this clear in the KIP 500
> description.
>
> Ismael
>
> On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <reluctanthero...@gmail.com>
> wrote:
>
> > Thanks Colin for initiating this important effort!
> >
> > One question I have is whether we have a session discussing the
> controller
> > failover in the new architecture? I know we are using Raft protocol to
> > failover, yet it's still valuable to discuss the steps new cluster is
> going
> > to take to reach the stable stage again, so that we could easily measure
> > the availability of the metadata servers.
> >
> > Another suggestion I have is to write a step-by-step design doc like what
> > we did in KIP-98
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> > >,
> > including the new request protocols and how they are interacting in the
> new
> > cluster. For a complicated change like this, an implementation design doc
> > help a lot in the review process, otherwise most discussions we have will
> > focus on high level and lose important details as we discover them in the
> > post-agreement phase.
> >
> > Boyang
> >
> > On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio wrote:
> > > > Thanks Colin for the detail KIP. I have a few comments and questions.
> > > >
> > > > In the KIP's Motivation and Overview you mentioned the LeaderAndIsr
> and
> > > > UpdateMetadata RPC. For example, "updates which the controller
> pushes,
> > > such
> > > > as LeaderAndIsr and UpdateMetadata messages". Is your thinking that
> we
> > > will
> > > > use MetadataFetch as a replacement to just UpdateMetadata only and
> add
> > > > topic configuration in this state?
> > > >
> > >
> > > Hi Jose,
> > >
> > > Thanks for taking a look.
> > >
> > > The goal is for MetadataFetchRequest to replace both
> LeaderAndIsrRequest
> > > and UpdateMetadataRequest.  Topic configurations would be fetched along
> > > with the other metadata.
> > >
> > > > In the section "Broker Metadata Management", you mention "Just like
> > with
> > > a
> > > > fetch request, the broker will track the offset of the last updates
> it
> > > > fetched". To keep the log consistent Raft requires that the followers
> > > keep
> > > > all of the log entries (term/epoch and offset) that are after the
> > > > highwatermark. Any log entry before the highwatermark can be
> > > > compacted/snapshot. Do we expect the MetadataFetch API to only return
> > log
> > > > entries up to the highwatermark?  Unlike the Raft replication API
> which
> > > > will replicate/fetch log entries after the highwatermark for
> consensus?
> > >
> > > Good question.  Clearly, we shouldn't expose metadata updates to the
> > > brokers until they've been stored on a majority of the Raft nodes.  The
> > > most obvious way to do that, like you mentioned, is to have the brokers
> > > only fetch up to the HWM, but not beyond.  There might be a more clever
> > way
> > > to do it by fetching the data, but not having the brokers act on it
> until
> > > the HWM advances.  I'm not sure if that's worth it or not.  We'll
> discuss
> > > this more in a separate KIP that just discusses just Raft.
> > >
> > > >
> > > > In section "Broker Metadata Management", you mention "the controller
> > will
> > > > send a full metadata image rather than a series of deltas". This KIP
> > > > doesn't go into the set of operations that need to be supported on
> top
> > of
> > > > Raft but it would be interested if this "full metadata image" could
> be
> > > > express also as deltas. For example, assuming we are replicating a
> map
> > > this
> > > > "full metadata image" could be a sequence of "put" operations (znode
> > > create
> > > > to borrow ZK semantics).
> > >
> > > The full image can definitely be expressed as a sum of deltas.  At some
> > > point, the number of deltas will get large enough that sending a full
> > image
> > > is better, though.  One question that we're still thinking about is how
> > > much of this can be shared with generic Kafka log code, and how much
> > should
> > > be different.
> > >
> > > >
> > > > In section "Broker Metadata Management", you mention "This request
> will
> > > > double as a heartbeat, letting the controller know that the broker is
> > > > alive". In section "Broker State Machine", you mention "The
> > MetadataFetch
> > > > API serves as this registration mechanism". Does this mean that the
> > > > MetadataFetch Request will optionally include broker configuration
> > > > information?
> > >
> > > I was originally thinking that the MetadataFetchRequest should include
> > > broker configuration information.  Thinking about this more, maybe we
> > > should just have a special registration RPC that contains that
> > information,
> > > to avoid sending it over the wire all the time.
> > >
> > > > Does this also mean that MetadataFetch request will result in
> > > > a "write"/AppendEntries through the Raft replication protocol before
> > you
> > > > can send the associated MetadataFetch Response?
> > >
> > > I think we should require the broker to be out of the Offline state
> > before
> > > allowing it to fetch metadata, yes.  So the separate registration RPC
> > > should have completed first.
> > >
> > > >
> > > > In section "Broker State", you mention that a broker can transition
> to
> > > > online after it is caught with the metadata. What do you mean by
> this?
> > > > Metadata is always changing. How does the broker know that it is
> caught
> > > up
> > > > since it doesn't participate in the consensus or the advancement of
> the
> > > > highwatermark?
> > >
> > > That's a good point.  Being "caught up" is somewhat of a fuzzy concept
> > > here, since the brokers do not participate in the metadata consensus.
> I
> > > think ideally we would want to define it in terms of time ("the broker
> > has
> > > all the updates from the last 2 minutes", for example.)  We should
> spell
> > > this out better in the KIP.
> > >
> > > >
> > > > In section "Start the controller quorum nodes", you mention "Once it
> > has
> > > > taken over the /controller node, the active controller will proceed
> to
> > > load
> > > > the full state of ZooKeeper.  It will write out this information to
> the
> > > > quorum's metadata storage.  After this point, the metadata quorum
> will
> > be
> > > > the metadata store of record, rather than the data in ZooKeeper."
> > During
> > > > this migration do should we expect to have a small period controller
> > > > unavailability while the controller replicas this state to all of the
> > > raft
> > > > nodes in the controller quorum and we buffer new controller API
> > requests?
> > >
> > > Yes, the controller would be unavailable during this time.  I don't
> think
> > > this will be that different from the current period of unavailability
> > when
> > > a new controller starts up and needs to load the full state from ZK.
> The
> > > main difference is that in this period, we'd have to write to the
> > > controller quorum rather than just to memory.  But we believe this
> should
> > > be pretty fast.
> > >
> > > regards,
> > > Colin
> > >
> > > >
> > > > Thanks!
> > > > -Jose
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to