Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Colin McCabe Mon, 19 Aug 2019 15:54:39 -0700

Hi all,

The KIP has been out for a while, so I'm thinking about calling a vote some 
time this week.


best,
Colin

On Mon, Aug 19, 2019, at 15:52, Colin McCabe wrote:
> On Mon, Aug 19, 2019, at 12:52, David Arthur wrote:
> > Thanks for the KIP, Colin. This looks great!
> > 
> > I really like the idea of separating the Controller and Broker JVMs.
> > 
> > As you alluded to above, it might be nice to have a separate
> > broker-registration API to avoid overloading the metadata fetch API.
> >
> 
> Hi David,
> 
> Thanks for taking a look.
> 
> I removed the sentence about MetadataFetch also serving as the broker 
> registration API.  I think I agree that we will probably want a 
> separate RPC to fill this role.  We will have a follow-on KIP that will 
> go into more detail about metadata propagation and registration in the 
> post-ZK world.  That KIP will also have a full description of the 
> registration RPC, etc.  For now, I think the important part for KIP-500 
> is that the broker registers with the controller quorum.  On 
> registration, the controller quorum assigns it a new broker epoch, 
> which can distinguish successive broker incarnations.
> 
> > 
> > When a broker gets a metadata delta, will it be a sequence of deltas since
> > the last update or a cumulative delta since the last update?
> >
> 
> It will be a sequence of deltas.  Basically, the broker will be reading 
> from the metadata log.
> 
> >
> > Will we include any kind of integrity check on the deltas to ensure the 
> > brokers
> > have applied them correctly? Perhaps this will be addressed in one of the
> > follow-on KIPs.
> > 
> 
> In general, we will have checksums on the metadata that we fetch.  This 
> is similar to how we have checksums on regular data.  Or if the 
> question is about catching logic errors in the metadata handling code, 
> that sounds more like something that should be caught by test cases.
> 
> best,
> Colin
> 
> 
> > Thanks!
> > 
> > On Fri, Aug 9, 2019 at 1:17 PM Colin McCabe <[email protected]> wrote:
> > 
> > > Hi Mickael,
> > >
> > > Thanks for taking a look.
> > >
> > > I don't think we want to support that kind of multi-tenancy at the
> > > controller level.  If the cluster is small enough that we want to pack the
> > > controller(s) with something else, we could run them alongside the 
> > > brokers,
> > > or possibly inside three of the broker JVMs.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Wed, Aug 7, 2019, at 10:37, Mickael Maison wrote:
> > > > Thank Colin for kickstarting this initiative.
> > > >
> > > > Just one question.
> > > > - A nice feature of Zookeeper is the ability to use chroots and have
> > > > several Kafka clusters use the same Zookeeper ensemble. Is this
> > > > something we should keep?
> > > >
> > > > Thanks
> > > >
> > > > On Mon, Aug 5, 2019 at 7:44 PM Colin McCabe <[email protected]> wrote:
> > > > >
> > > > > On Mon, Aug 5, 2019, at 10:02, Tom Bentley wrote:
> > > > > > Hi Colin,
> > > > > >
> > > > > > Thanks for the KIP.
> > > > > >
> > > > > > Currently ZooKeeper provides a convenient notification mechanism for
> > > > > > knowing that broker and topic configuration has changed. While
> > > KIP-500 does
> > > > > > suggest that incremental metadata update is expected to come to
> > > clients
> > > > > > eventually, that would seem to imply that for some number of
> > > releases there
> > > > > > would be no equivalent mechanism for knowing about config changes.
> > > Is there
> > > > > > any thinking at this point about how a similar notification might be
> > > > > > provided in the future?
> > > > >
> > > > > We could eventually have some inotify-like mechanism where clients
> > > could register interest in various types of events and got notified when
> > > they happened.  Reading the metadata log is conceptually simple.  The main
> > > complexity would be in setting up an API that made sense and that didn't
> > > unduly constrain future implementations.  We'd have to think carefully
> > > about what the real use-cases for this were, though.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Mon, Aug 5, 2019 at 3:49 PM Viktor Somogyi-Vass <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Colin,
> > > > > > >
> > > > > > > I think this is a long-awaited KIP, thanks for driving it. I'm
> > > excited to
> > > > > > > see this in Kafka once. I collected my questions (and I accept the
> > > "TBD"
> > > > > > > answer as they might be a bit deep for this high level :) ).
> > > > > > > 1.) Are there any specific reasons for the Controller just
> > > periodically
> > > > > > > persisting its state on disk periodically instead of
> > > asynchronously with
> > > > > > > every update? Wouldn't less frequent saves increase the chance for
> > > missing
> > > > > > > a state change if the controller crashes between two saves?
> > > > > > > 2.) Why can't we allow brokers to fetch metadata from the follower
> > > > > > > controllers? I assume that followers would have up-to-date
> > > information
> > > > > > > therefore brokers could fetch from there in theory.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Viktor
> > > > > > >
> > > > > > > On Sun, Aug 4, 2019 at 6:58 AM Boyang Chen <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for explaining Ismael! Breaking down into follow-up KIPs
> > > sounds
> > > > > > > like
> > > > > > > > a good idea.
> > > > > > > >
> > > > > > > > On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <[email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi Boyang,
> > > > > > > > >
> > > > > > > > > Yes, there will be several KIPs that will discuss the items 
> > > > > > > > > you
> > > > > > > describe
> > > > > > > > in
> > > > > > > > > detail. Colin, it may be helpful to make this clear in the KIP
> > > 500
> > > > > > > > > description.
> > > > > > > > >
> > > > > > > > > Ismael
> > > > > > > > >
> > > > > > > > > On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <
> > > [email protected]
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Colin for initiating this important effort!
> > > > > > > > > >
> > > > > > > > > > One question I have is whether we have a session discussing
> > > the
> > > > > > > > > controller
> > > > > > > > > > failover in the new architecture? I know we are using Raft
> > > protocol
> > > > > > > to
> > > > > > > > > > failover, yet it's still valuable to discuss the steps new
> > > cluster is
> > > > > > > > > going
> > > > > > > > > > to take to reach the stable stage again, so that we could
> > > easily
> > > > > > > > measure
> > > > > > > > > > the availability of the metadata servers.
> > > > > > > > > >
> > > > > > > > > > Another suggestion I have is to write a step-by-step design
> > > doc like
> > > > > > > > what
> > > > > > > > > > we did in KIP-98
> > > > > > > > > > <
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> > > > > > > > > > >,
> > > > > > > > > > including the new request protocols and how they are
> > > interacting in
> > > > > > > the
> > > > > > > > > new
> > > > > > > > > > cluster. For a complicated change like this, an
> > > implementation design
> > > > > > > > doc
> > > > > > > > > > help a lot in the review process, otherwise most discussions
> > > we have
> > > > > > > > will
> > > > > > > > > > focus on high level and lose important details as we
> > > discover them in
> > > > > > > > the
> > > > > > > > > > post-agreement phase.
> > > > > > > > > >
> > > > > > > > > > Boyang
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe <
> > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio
> > > wrote:
> > > > > > > > > > > > Thanks Colin for the detail KIP. I have a few comments
> > > and
> > > > > > > > questions.
> > > > > > > > > > > >
> > > > > > > > > > > > In the KIP's Motivation and Overview you mentioned the
> > > > > > > LeaderAndIsr
> > > > > > > > > and
> > > > > > > > > > > > UpdateMetadata RPC. For example, "updates which the
> > > controller
> > > > > > > > > pushes,
> > > > > > > > > > > such
> > > > > > > > > > > > as LeaderAndIsr and UpdateMetadata messages". Is your
> > > thinking
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > will
> > > > > > > > > > > > use MetadataFetch as a replacement to just
> > > UpdateMetadata only
> > > > > > > and
> > > > > > > > > add
> > > > > > > > > > > > topic configuration in this state?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Jose,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for taking a look.
> > > > > > > > > > >
> > > > > > > > > > > The goal is for MetadataFetchRequest to replace both
> > > > > > > > > LeaderAndIsrRequest
> > > > > > > > > > > and UpdateMetadataRequest.  Topic configurations would be
> > > fetched
> > > > > > > > along
> > > > > > > > > > > with the other metadata.
> > > > > > > > > > >
> > > > > > > > > > > > In the section "Broker Metadata Management", you mention
> > > "Just
> > > > > > > like
> > > > > > > > > > with
> > > > > > > > > > > a
> > > > > > > > > > > > fetch request, the broker will track the offset of the
> > > last
> > > > > > > updates
> > > > > > > > > it
> > > > > > > > > > > > fetched". To keep the log consistent Raft requires that
> > > the
> > > > > > > > followers
> > > > > > > > > > > keep
> > > > > > > > > > > > all of the log entries (term/epoch and offset) that are
> > > after the
> > > > > > > > > > > > highwatermark. Any log entry before the highwatermark
> > > can be
> > > > > > > > > > > > compacted/snapshot. Do we expect the MetadataFetch API
> > > to only
> > > > > > > > return
> > > > > > > > > > log
> > > > > > > > > > > > entries up to the highwatermark?  Unlike the Raft
> > > replication API
> > > > > > > > > which
> > > > > > > > > > > > will replicate/fetch log entries after the highwatermark
> > > for
> > > > > > > > > consensus?
> > > > > > > > > > >
> > > > > > > > > > > Good question.  Clearly, we shouldn't expose metadata
> > > updates to
> > > > > > > the
> > > > > > > > > > > brokers until they've been stored on a majority of the
> > > Raft nodes.
> > > > > > > > The
> > > > > > > > > > > most obvious way to do that, like you mentioned, is to
> > > have the
> > > > > > > > brokers
> > > > > > > > > > > only fetch up to the HWM, but not beyond.  There might be
> > > a more
> > > > > > > > clever
> > > > > > > > > > way
> > > > > > > > > > > to do it by fetching the data, but not having the brokers
> > > act on it
> > > > > > > > > until
> > > > > > > > > > > the HWM advances.  I'm not sure if that's worth it or
> > > not.  We'll
> > > > > > > > > discuss
> > > > > > > > > > > this more in a separate KIP that just discusses just Raft.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > In section "Broker Metadata Management", you mention 
> > > > > > > > > > > > "the
> > > > > > > > controller
> > > > > > > > > > will
> > > > > > > > > > > > send a full metadata image rather than a series of
> > > deltas". This
> > > > > > > > KIP
> > > > > > > > > > > > doesn't go into the set of operations that need to be
> > > supported
> > > > > > > on
> > > > > > > > > top
> > > > > > > > > > of
> > > > > > > > > > > > Raft but it would be interested if this "full metadata
> > > image"
> > > > > > > could
> > > > > > > > > be
> > > > > > > > > > > > express also as deltas. For example, assuming we are
> > > replicating
> > > > > > > a
> > > > > > > > > map
> > > > > > > > > > > this
> > > > > > > > > > > > "full metadata image" could be a sequence of "put"
> > > operations
> > > > > > > > (znode
> > > > > > > > > > > create
> > > > > > > > > > > > to borrow ZK semantics).
> > > > > > > > > > >
> > > > > > > > > > > The full image can definitely be expressed as a sum of
> > > deltas.  At
> > > > > > > > some
> > > > > > > > > > > point, the number of deltas will get large enough that
> > > sending a
> > > > > > > full
> > > > > > > > > > image
> > > > > > > > > > > is better, though.  One question that we're still thinking
> > > about is
> > > > > > > > how
> > > > > > > > > > > much of this can be shared with generic Kafka log code,
> > > and how
> > > > > > > much
> > > > > > > > > > should
> > > > > > > > > > > be different.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > In section "Broker Metadata Management", you mention
> > > "This
> > > > > > > request
> > > > > > > > > will
> > > > > > > > > > > > double as a heartbeat, letting the controller know that
> > > the
> > > > > > > broker
> > > > > > > > is
> > > > > > > > > > > > alive". In section "Broker State Machine", you mention
> > > "The
> > > > > > > > > > MetadataFetch
> > > > > > > > > > > > API serves as this registration mechanism". Does this
> > > mean that
> > > > > > > the
> > > > > > > > > > > > MetadataFetch Request will optionally include broker
> > > > > > > configuration
> > > > > > > > > > > > information?
> > > > > > > > > > >
> > > > > > > > > > > I was originally thinking that the MetadataFetchRequest
> > > should
> > > > > > > > include
> > > > > > > > > > > broker configuration information.  Thinking about this
> > > more, maybe
> > > > > > > we
> > > > > > > > > > > should just have a special registration RPC that contains
> > > that
> > > > > > > > > > information,
> > > > > > > > > > > to avoid sending it over the wire all the time.
> > > > > > > > > > >
> > > > > > > > > > > > Does this also mean that MetadataFetch request will
> > > result in
> > > > > > > > > > > > a "write"/AppendEntries through the Raft replication
> > > protocol
> > > > > > > > before
> > > > > > > > > > you
> > > > > > > > > > > > can send the associated MetadataFetch Response?
> > > > > > > > > > >
> > > > > > > > > > > I think we should require the broker to be out of the
> > > Offline state
> > > > > > > > > > before
> > > > > > > > > > > allowing it to fetch metadata, yes.  So the separate
> > > registration
> > > > > > > RPC
> > > > > > > > > > > should have completed first.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > In section "Broker State", you mention that a broker can
> > > > > > > transition
> > > > > > > > > to
> > > > > > > > > > > > online after it is caught with the metadata. What do you
> > > mean by
> > > > > > > > > this?
> > > > > > > > > > > > Metadata is always changing. How does the broker know
> > > that it is
> > > > > > > > > caught
> > > > > > > > > > > up
> > > > > > > > > > > > since it doesn't participate in the consensus or the
> > > advancement
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > highwatermark?
> > > > > > > > > > >
> > > > > > > > > > > That's a good point.  Being "caught up" is somewhat of a
> > > fuzzy
> > > > > > > > concept
> > > > > > > > > > > here, since the brokers do not participate in the metadata
> > > > > > > consensus.
> > > > > > > > > I
> > > > > > > > > > > think ideally we would want to define it in terms of time
> > > ("the
> > > > > > > > broker
> > > > > > > > > > has
> > > > > > > > > > > all the updates from the last 2 minutes", for example.)
> > > We should
> > > > > > > > > spell
> > > > > > > > > > > this out better in the KIP.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > In section "Start the controller quorum nodes", you
> > > mention "Once
> > > > > > > > it
> > > > > > > > > > has
> > > > > > > > > > > > taken over the /controller node, the active controller
> > > will
> > > > > > > proceed
> > > > > > > > > to
> > > > > > > > > > > load
> > > > > > > > > > > > the full state of ZooKeeper.  It will write out this
> > > information
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > quorum's metadata storage.  After this point, the
> > > metadata quorum
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > the metadata store of record, rather than the data in
> > > ZooKeeper."
> > > > > > > > > > During
> > > > > > > > > > > > this migration do should we expect to have a small 
> > > > > > > > > > > > period
> > > > > > > > controller
> > > > > > > > > > > > unavailability while the controller replicas this state
> > > to all of
> > > > > > > > the
> > > > > > > > > > > raft
> > > > > > > > > > > > nodes in the controller quorum and we buffer new
> > > controller API
> > > > > > > > > > requests?
> > > > > > > > > > >
> > > > > > > > > > > Yes, the controller would be unavailable during this
> > > time.  I don't
> > > > > > > > > think
> > > > > > > > > > > this will be that different from the current period of
> > > > > > > unavailability
> > > > > > > > > > when
> > > > > > > > > > > a new controller starts up and needs to load the full
> > > state from
> > > > > > > ZK.
> > > > > > > > > The
> > > > > > > > > > > main difference is that in this period, we'd have to write
> > > to the
> > > > > > > > > > > controller quorum rather than just to memory.  But we
> > > believe this
> > > > > > > > > should
> > > > > > > > > > > be pretty fast.
> > > > > > > > > > >
> > > > > > > > > > > regards,
> > > > > > > > > > > Colin
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > -Jose
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > 
> > 
> > -- 
> > David Arthur
> >
>

Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to