Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Colin McCabe Wed, 21 Aug 2019 14:43:57 -0700

Hi Ryanne,

Apache Ratis looks like a very interesting project, but I don't think this is 
the right use-case for it.  At its heart, Apache Kafka is a system for managing 
logs.  We should avoid adding a dependency on an external system to manage the 
logs of Kafka itself, since that is one of Kafka's core functions.


In the past we've successfully used Kafka itself to store metadata about 
consumer offsets, transactions, and so on.  This is the same kind of thing, 
except that we need a slightly different replication protocol to avoid the 
dependency that the existing one has on the controller.

I think that down the road, having support for quorum replication in the Kafka 
will be useful for regular topics, not just for the metadata partition.  
Quorum-based replication has dramatically lower tail latencies than the 
acks=all configuration that many people use currently.  The tradeoff is that 
Raft doesn't support redundancy with fewer than 3 replicas.  But that is a 
tradeoff that is appropriate to make for many applications.

best,
Colin

On Wed, Aug 21, 2019, at 12:19, Ryanne Dolan wrote:
> Colin, have you considered leveraging Apache Ratis (incubating)?
> 
> Ryanne
> 
> On Wed, Aug 21, 2019, 1:28 PM Colin McCabe <cmcc...@apache.org> wrote:
> 
> > On Wed, Aug 21, 2019, at 06:38, Eno Thereska wrote:
> > > Hi Colin,
> > >
> > > Nice KIP! For such a big change it would be good to add a pointer or
> > > two to related work that provides some sort of soft proof that the
> > > approach taken makes sense. Also such work often builds on other work
> > > and it might be useful to trace its roots. May I recommend adding a
> > > pointer to "Tango: Distributed Data Structures over a Shared Log"
> > > (http://www.cs.cornell.edu/~taozou/sosp13/tangososp.pdf)? There are
> > > other papers that are related (e.g., a more recent one one "The
> > > FuzzyLog: A Partially Ordered Shared Log"
> > > (https://www.usenix.org/system/files/osdi18-lockerman.pdf)).
> > >
> > > Both papers would add to the strength of your motivation.
> > >
> >
> > Hi Eno,
> >
> > Good point.  I added a "references" section on the end and added the Tango
> > paper.  I am not sure we need the FuzzyLog one, though.
> >
> > I also added a link to the Raft paper and one of the papers on HDFS, since
> > I feel like these are very relevant here.
> >
> > best,
> > Colin
> >
> > > Cheers
> > > Eno
> > >
> > > On Wed, Aug 21, 2019 at 12:22 PM Ron Dagostino <rndg...@gmail.com>
> > wrote:
> > > >
> > > > Hi Colin.  I like the concept of a "bridge release" for migrating off
> > of
> > > > Zookeeper, but I worry that it may become a bottleneck if people
> > hesitate
> > > > to replace Zookeeper -- they would be unable to adopt newer versions of
> > > > Kafka until taking (what feels to them like) a giant leap.  As an
> > example,
> > > > assuming version 4.0.x of Kafka is the supported bridge release, I
> > would
> > > > not be surprised if uptake of the 4.x release and the time-based
> > releases
> > > > that follow it end up being much slower due to the perceived barrier.
> > > >
> > > > Any perceived barrier could be lowered if the 4.0.x release could
> > > > optionally continue to use Zookeeper -- then the cutover would be two
> > > > incremental steps (move to 4.0.x, then replace Zookeeper while staying
> > on
> > > > 4.0.x) as opposed to a single big-bang (upgrade to 4.0.x and replace
> > > > Zookeeper in one fell swoop).
> > > >
> > > > Regardless of whether what I wrote above has merit or not, I think the
> > KIP
> > > > should be more explicit about what the upgrade constraints actually
> > are.
> > > > Can the bridge release be adopted with Zookeeper remaining in place and
> > > > then cutting over as a second, follow-on step, or must the Controller
> > > > Quorum nodes be started first and the bridge release cannot be used
> > with
> > > > Zookeeper at all?  If the bridge release cannot be used with Zookeeper
> > at
> > > > all, then no version at or beyond the bridge release is available
> > > > unless/until abandoning Zookeeper; if the bridge release can be used
> > with
> > > > Zookeeper, then is it the only version that can be used with
> > Zookeeper, or
> > > > can Zookeeper be kept for additional releases if desired?
> > > >
> > > > Ron
> > > >
> > > > On Tue, Aug 20, 2019 at 10:19 AM Ron Dagostino <rndg...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Colin.  The diagram up at the top confused me -- specifically, the
> > > > > lines connecting the controller/active-controller to the brokers.  I
> > had
> > > > > assumed the arrows on those lines represented the direction of data
> > flow,
> > > > > but that is not the case; the arrows actually identify the target of
> > the
> > > > > action, and the non-arrowed end indicates the initiator of the
> > action.  For
> > > > > example, the lines point from the controller to the brokers in the
> > "today"
> > > > > section on the left to show that the controller pushes to the
> > brokers; the
> > > > > lines point from the brokers to the active-controller in the
> > "tomorrow"
> > > > > section on the right to show that the brokers pull from the
> > > > > active-controller.  As I said, this confused me because my gut
> > instinct was
> > > > > to interpret the arrow as indicating the direction of data flow, and
> > when I
> > > > > look at the "tomorrow" picture on the right I initially thought
> > information
> > > > > was moving from the brokers to the active-controller.  Did you
> > consider
> > > > > drawing that picture with the arrows reversed in the "tomorrow" side
> > so
> > > > > that the arrows represent the direction of data flow, and then add
> > the
> > > > > labels "push" on the "today" side and "pull" on the "tomorrow" side
> > to
> > > > > indicate who initiates the data flow?  It occurs to me that this
> > picture
> > > > > may end up being widely distributed, so it might be in everyone's
> > interest
> > > > > to proactively avoid any possible confusion by being more explicit.
> > > > >
> > > > > Minor corrections?
> > > > > <<<In the current world, a broker which can contact ZooKeeper but
> > which
> > > > > is partitioned from the active controller
> > > > > >>>In the current world, a broker which can contact ZooKeeper but
> > which
> > > > > is partitioned from the controller
> > > > >
> > > > > <<<Eventually, the controller will ask the broker to finally go
> > offline
> > > > > >>>Eventually, the active controller will ask the broker to finally
> > go
> > > > > offline
> > > > >
> > > > > <<<New versions of the clients should send these operations directly
> > to
> > > > > the controller
> > > > > >>>New versions of the clients should send these operations directly
> > to
> > > > > the active controller
> > > > >
> > > > > <<<In the post-ZK world, the leader will make an RPC to the
> > controller
> > > > > instead
> > > > > >>>In the post-ZK world, the leader will make an RPC to the active
> > > > > controller instead
> > > > >
> > > > > <<<For example, the brokers may need to forward their requests to the
> > > > > controller.
> > > > > >>>For example, the brokers may need to forward their requests to the
> > > > > active controller.
> > > > >
> > > > > <<<The new controller will monitor ZooKeeper for legacy broker node
> > > > > registrations
> > > > > >>>The new (active) controller will monitor ZooKeeper for legacy
> > broker
> > > > > node registrations
> > > > >
> > > > > Ron
> > > > >
> > > > > On Mon, Aug 19, 2019 at 6:53 PM Colin McCabe <cmcc...@apache.org>
> > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> The KIP has been out for a while, so I'm thinking about calling a
> > vote
> > > > >> some time this week.
> > > > >>
> > > > >> best,
> > > > >> Colin
> > > > >>
> > > > >> On Mon, Aug 19, 2019, at 15:52, Colin McCabe wrote:
> > > > >> > On Mon, Aug 19, 2019, at 12:52, David Arthur wrote:
> > > > >> > > Thanks for the KIP, Colin. This looks great!
> > > > >> > >
> > > > >> > > I really like the idea of separating the Controller and Broker
> > JVMs.
> > > > >> > >
> > > > >> > > As you alluded to above, it might be nice to have a separate
> > > > >> > > broker-registration API to avoid overloading the metadata fetch
> > API.
> > > > >> > >
> > > > >> >
> > > > >> > Hi David,
> > > > >> >
> > > > >> > Thanks for taking a look.
> > > > >> >
> > > > >> > I removed the sentence about MetadataFetch also serving as the
> > broker
> > > > >> > registration API.  I think I agree that we will probably want a
> > > > >> > separate RPC to fill this role.  We will have a follow-on KIP
> > that will
> > > > >> > go into more detail about metadata propagation and registration
> > in the
> > > > >> > post-ZK world.  That KIP will also have a full description of the
> > > > >> > registration RPC, etc.  For now, I think the important part for
> > KIP-500
> > > > >> > is that the broker registers with the controller quorum.  On
> > > > >> > registration, the controller quorum assigns it a new broker epoch,
> > > > >> > which can distinguish successive broker incarnations.
> > > > >> >
> > > > >> > >
> > > > >> > > When a broker gets a metadata delta, will it be a sequence of
> > deltas
> > > > >> since
> > > > >> > > the last update or a cumulative delta since the last update?
> > > > >> > >
> > > > >> >
> > > > >> > It will be a sequence of deltas.  Basically, the broker will be
> > reading
> > > > >> > from the metadata log.
> > > > >> >
> > > > >> > >
> > > > >> > > Will we include any kind of integrity check on the deltas to
> > ensure
> > > > >> the brokers
> > > > >> > > have applied them correctly? Perhaps this will be addressed in
> > one of
> > > > >> the
> > > > >> > > follow-on KIPs.
> > > > >> > >
> > > > >> >
> > > > >> > In general, we will have checksums on the metadata that we
> > fetch.  This
> > > > >> > is similar to how we have checksums on regular data.  Or if the
> > > > >> > question is about catching logic errors in the metadata handling
> > code,
> > > > >> > that sounds more like something that should be caught by test
> > cases.
> > > > >> >
> > > > >> > best,
> > > > >> > Colin
> > > > >> >
> > > > >> >
> > > > >> > > Thanks!
> > > > >> > >
> > > > >> > > On Fri, Aug 9, 2019 at 1:17 PM Colin McCabe <cmcc...@apache.org
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi Mickael,
> > > > >> > > >
> > > > >> > > > Thanks for taking a look.
> > > > >> > > >
> > > > >> > > > I don't think we want to support that kind of multi-tenancy
> > at the
> > > > >> > > > controller level.  If the cluster is small enough that we
> > want to
> > > > >> pack the
> > > > >> > > > controller(s) with something else, we could run them
> > alongside the
> > > > >> brokers,
> > > > >> > > > or possibly inside three of the broker JVMs.
> > > > >> > > >
> > > > >> > > > best,
> > > > >> > > > Colin
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Wed, Aug 7, 2019, at 10:37, Mickael Maison wrote:
> > > > >> > > > > Thank Colin for kickstarting this initiative.
> > > > >> > > > >
> > > > >> > > > > Just one question.
> > > > >> > > > > - A nice feature of Zookeeper is the ability to use chroots
> > and
> > > > >> have
> > > > >> > > > > several Kafka clusters use the same Zookeeper ensemble. Is
> > this
> > > > >> > > > > something we should keep?
> > > > >> > > > >
> > > > >> > > > > Thanks
> > > > >> > > > >
> > > > >> > > > > On Mon, Aug 5, 2019 at 7:44 PM Colin McCabe <
> > cmcc...@apache.org>
> > > > >> wrote:
> > > > >> > > > > >
> > > > >> > > > > > On Mon, Aug 5, 2019, at 10:02, Tom Bentley wrote:
> > > > >> > > > > > > Hi Colin,
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks for the KIP.
> > > > >> > > > > > >
> > > > >> > > > > > > Currently ZooKeeper provides a convenient notification
> > > > >> mechanism for
> > > > >> > > > > > > knowing that broker and topic configuration has
> > changed. While
> > > > >> > > > KIP-500 does
> > > > >> > > > > > > suggest that incremental metadata update is expected to
> > come
> > > > >> to
> > > > >> > > > clients
> > > > >> > > > > > > eventually, that would seem to imply that for some
> > number of
> > > > >> > > > releases there
> > > > >> > > > > > > would be no equivalent mechanism for knowing about
> > config
> > > > >> changes.
> > > > >> > > > Is there
> > > > >> > > > > > > any thinking at this point about how a similar
> > notification
> > > > >> might be
> > > > >> > > > > > > provided in the future?
> > > > >> > > > > >
> > > > >> > > > > > We could eventually have some inotify-like mechanism where
> > > > >> clients
> > > > >> > > > could register interest in various types of events and got
> > notified
> > > > >> when
> > > > >> > > > they happened.  Reading the metadata log is conceptually
> > simple.
> > > > >> The main
> > > > >> > > > complexity would be in setting up an API that made sense and
> > that
> > > > >> didn't
> > > > >> > > > unduly constrain future implementations.  We'd have to think
> > > > >> carefully
> > > > >> > > > about what the real use-cases for this were, though.
> > > > >> > > > > >
> > > > >> > > > > > best,
> > > > >> > > > > > Colin
> > > > >> > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks,
> > > > >> > > > > > >
> > > > >> > > > > > > Tom
> > > > >> > > > > > >
> > > > >> > > > > > > On Mon, Aug 5, 2019 at 3:49 PM Viktor Somogyi-Vass <
> > > > >> > > > viktorsomo...@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Hey Colin,
> > > > >> > > > > > > >
> > > > >> > > > > > > > I think this is a long-awaited KIP, thanks for
> > driving it.
> > > > >> I'm
> > > > >> > > > excited to
> > > > >> > > > > > > > see this in Kafka once. I collected my questions (and
> > I
> > > > >> accept the
> > > > >> > > > "TBD"
> > > > >> > > > > > > > answer as they might be a bit deep for this high
> > level :) ).
> > > > >> > > > > > > > 1.) Are there any specific reasons for the Controller
> > just
> > > > >> > > > periodically
> > > > >> > > > > > > > persisting its state on disk periodically instead of
> > > > >> > > > asynchronously with
> > > > >> > > > > > > > every update? Wouldn't less frequent saves increase
> > the
> > > > >> chance for
> > > > >> > > > missing
> > > > >> > > > > > > > a state change if the controller crashes between two
> > saves?
> > > > >> > > > > > > > 2.) Why can't we allow brokers to fetch metadata from
> > the
> > > > >> follower
> > > > >> > > > > > > > controllers? I assume that followers would have
> > up-to-date
> > > > >> > > > information
> > > > >> > > > > > > > therefore brokers could fetch from there in theory.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks,
> > > > >> > > > > > > > Viktor
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Sun, Aug 4, 2019 at 6:58 AM Boyang Chen <
> > > > >> > > > reluctanthero...@gmail.com>
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Thanks for explaining Ismael! Breaking down into
> > > > >> follow-up KIPs
> > > > >> > > > sounds
> > > > >> > > > > > > > like
> > > > >> > > > > > > > > a good idea.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <
> > > > >> ism...@juma.me.uk>
> > > > >> > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hi Boyang,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Yes, there will be several KIPs that will discuss
> > the
> > > > >> items you
> > > > >> > > > > > > > describe
> > > > >> > > > > > > > > in
> > > > >> > > > > > > > > > detail. Colin, it may be helpful to make this
> > clear in
> > > > >> the KIP
> > > > >> > > > 500
> > > > >> > > > > > > > > > description.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Ismael
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <
> > > > >> > > > reluctanthero...@gmail.com
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Thanks Colin for initiating this important
> > effort!
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > One question I have is whether we have a session
> > > > >> discussing
> > > > >> > > > the
> > > > >> > > > > > > > > > controller
> > > > >> > > > > > > > > > > failover in the new architecture? I know we are
> > using
> > > > >> Raft
> > > > >> > > > protocol
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > > failover, yet it's still valuable to discuss the
> > > > >> steps new
> > > > >> > > > cluster is
> > > > >> > > > > > > > > > going
> > > > >> > > > > > > > > > > to take to reach the stable stage again, so
> > that we
> > > > >> could
> > > > >> > > > easily
> > > > >> > > > > > > > > measure
> > > > >> > > > > > > > > > > the availability of the metadata servers.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Another suggestion I have is to write a
> > step-by-step
> > > > >> design
> > > > >> > > > doc like
> > > > >> > > > > > > > > what
> > > > >> > > > > > > > > > > we did in KIP-98
> > > > >> > > > > > > > > > > <
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > >
> > > > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
> > > > >> > > > > > > > > > > >,
> > > > >> > > > > > > > > > > including the new request protocols and how
> > they are
> > > > >> > > > interacting in
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > new
> > > > >> > > > > > > > > > > cluster. For a complicated change like this, an
> > > > >> > > > implementation design
> > > > >> > > > > > > > > doc
> > > > >> > > > > > > > > > > help a lot in the review process, otherwise most
> > > > >> discussions
> > > > >> > > > we have
> > > > >> > > > > > > > > will
> > > > >> > > > > > > > > > > focus on high level and lose important details
> > as we
> > > > >> > > > discover them in
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > post-agreement phase.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Boyang
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe <
> > > > >> > > > cmcc...@apache.org>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Fri, Aug 2, 2019, at 16:33, Jose Armando
> > Garcia
> > > > >> Sancio
> > > > >> > > > wrote:
> > > > >> > > > > > > > > > > > > Thanks Colin for the detail KIP. I have a
> > few
> > > > >> comments
> > > > >> > > > and
> > > > >> > > > > > > > > questions.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > In the KIP's Motivation and Overview you
> > > > >> mentioned the
> > > > >> > > > > > > > LeaderAndIsr
> > > > >> > > > > > > > > > and
> > > > >> > > > > > > > > > > > > UpdateMetadata RPC. For example, "updates
> > which
> > > > >> the
> > > > >> > > > controller
> > > > >> > > > > > > > > > pushes,
> > > > >> > > > > > > > > > > > such
> > > > >> > > > > > > > > > > > > as LeaderAndIsr and UpdateMetadata
> > messages". Is
> > > > >> your
> > > > >> > > > thinking
> > > > >> > > > > > > > that
> > > > >> > > > > > > > > > we
> > > > >> > > > > > > > > > > > will
> > > > >> > > > > > > > > > > > > use MetadataFetch as a replacement to just
> > > > >> > > > UpdateMetadata only
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > > add
> > > > >> > > > > > > > > > > > > topic configuration in this state?
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Hi Jose,
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks for taking a look.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > The goal is for MetadataFetchRequest to
> > replace both
> > > > >> > > > > > > > > > LeaderAndIsrRequest
> > > > >> > > > > > > > > > > > and UpdateMetadataRequest.  Topic
> > configurations
> > > > >> would be
> > > > >> > > > fetched
> > > > >> > > > > > > > > along
> > > > >> > > > > > > > > > > > with the other metadata.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > In the section "Broker Metadata
> > Management", you
> > > > >> mention
> > > > >> > > > "Just
> > > > >> > > > > > > > like
> > > > >> > > > > > > > > > > with
> > > > >> > > > > > > > > > > > a
> > > > >> > > > > > > > > > > > > fetch request, the broker will track the
> > offset
> > > > >> of the
> > > > >> > > > last
> > > > >> > > > > > > > updates
> > > > >> > > > > > > > > > it
> > > > >> > > > > > > > > > > > > fetched". To keep the log consistent Raft
> > > > >> requires that
> > > > >> > > > the
> > > > >> > > > > > > > > followers
> > > > >> > > > > > > > > > > > keep
> > > > >> > > > > > > > > > > > > all of the log entries (term/epoch and
> > offset)
> > > > >> that are
> > > > >> > > > after the
> > > > >> > > > > > > > > > > > > highwatermark. Any log entry before the
> > > > >> highwatermark
> > > > >> > > > can be
> > > > >> > > > > > > > > > > > > compacted/snapshot. Do we expect the
> > > > >> MetadataFetch API
> > > > >> > > > to only
> > > > >> > > > > > > > > return
> > > > >> > > > > > > > > > > log
> > > > >> > > > > > > > > > > > > entries up to the highwatermark?  Unlike
> > the Raft
> > > > >> > > > replication API
> > > > >> > > > > > > > > > which
> > > > >> > > > > > > > > > > > > will replicate/fetch log entries after the
> > > > >> highwatermark
> > > > >> > > > for
> > > > >> > > > > > > > > > consensus?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Good question.  Clearly, we shouldn't expose
> > > > >> metadata
> > > > >> > > > updates to
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > > > brokers until they've been stored on a
> > majority of
> > > > >> the
> > > > >> > > > Raft nodes.
> > > > >> > > > > > > > > The
> > > > >> > > > > > > > > > > > most obvious way to do that, like you
> > mentioned, is
> > > > >> to
> > > > >> > > > have the
> > > > >> > > > > > > > > brokers
> > > > >> > > > > > > > > > > > only fetch up to the HWM, but not beyond.
> > There
> > > > >> might be
> > > > >> > > > a more
> > > > >> > > > > > > > > clever
> > > > >> > > > > > > > > > > way
> > > > >> > > > > > > > > > > > to do it by fetching the data, but not having
> > the
> > > > >> brokers
> > > > >> > > > act on it
> > > > >> > > > > > > > > > until
> > > > >> > > > > > > > > > > > the HWM advances.  I'm not sure if that's
> > worth it
> > > > >> or
> > > > >> > > > not.  We'll
> > > > >> > > > > > > > > > discuss
> > > > >> > > > > > > > > > > > this more in a separate KIP that just
> > discusses
> > > > >> just Raft.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > In section "Broker Metadata Management", you
> > > > >> mention "the
> > > > >> > > > > > > > > controller
> > > > >> > > > > > > > > > > will
> > > > >> > > > > > > > > > > > > send a full metadata image rather than a
> > series of
> > > > >> > > > deltas". This
> > > > >> > > > > > > > > KIP
> > > > >> > > > > > > > > > > > > doesn't go into the set of operations that
> > need
> > > > >> to be
> > > > >> > > > supported
> > > > >> > > > > > > > on
> > > > >> > > > > > > > > > top
> > > > >> > > > > > > > > > > of
> > > > >> > > > > > > > > > > > > Raft but it would be interested if this
> > "full
> > > > >> metadata
> > > > >> > > > image"
> > > > >> > > > > > > > could
> > > > >> > > > > > > > > > be
> > > > >> > > > > > > > > > > > > express also as deltas. For example,
> > assuming we
> > > > >> are
> > > > >> > > > replicating
> > > > >> > > > > > > > a
> > > > >> > > > > > > > > > map
> > > > >> > > > > > > > > > > > this
> > > > >> > > > > > > > > > > > > "full metadata image" could be a sequence
> > of "put"
> > > > >> > > > operations
> > > > >> > > > > > > > > (znode
> > > > >> > > > > > > > > > > > create
> > > > >> > > > > > > > > > > > > to borrow ZK semantics).
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > The full image can definitely be expressed as
> > a sum
> > > > >> of
> > > > >> > > > deltas.  At
> > > > >> > > > > > > > > some
> > > > >> > > > > > > > > > > > point, the number of deltas will get large
> > enough
> > > > >> that
> > > > >> > > > sending a
> > > > >> > > > > > > > full
> > > > >> > > > > > > > > > > image
> > > > >> > > > > > > > > > > > is better, though.  One question that we're
> > still
> > > > >> thinking
> > > > >> > > > about is
> > > > >> > > > > > > > > how
> > > > >> > > > > > > > > > > > much of this can be shared with generic Kafka
> > log
> > > > >> code,
> > > > >> > > > and how
> > > > >> > > > > > > > much
> > > > >> > > > > > > > > > > should
> > > > >> > > > > > > > > > > > be different.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > In section "Broker Metadata Management", you
> > > > >> mention
> > > > >> > > > "This
> > > > >> > > > > > > > request
> > > > >> > > > > > > > > > will
> > > > >> > > > > > > > > > > > > double as a heartbeat, letting the
> > controller
> > > > >> know that
> > > > >> > > > the
> > > > >> > > > > > > > broker
> > > > >> > > > > > > > > is
> > > > >> > > > > > > > > > > > > alive". In section "Broker State Machine",
> > you
> > > > >> mention
> > > > >> > > > "The
> > > > >> > > > > > > > > > > MetadataFetch
> > > > >> > > > > > > > > > > > > API serves as this registration mechanism".
> > Does
> > > > >> this
> > > > >> > > > mean that
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > > > > MetadataFetch Request will optionally
> > include
> > > > >> broker
> > > > >> > > > > > > > configuration
> > > > >> > > > > > > > > > > > > information?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > I was originally thinking that the
> > > > >> MetadataFetchRequest
> > > > >> > > > should
> > > > >> > > > > > > > > include
> > > > >> > > > > > > > > > > > broker configuration information.  Thinking
> > about
> > > > >> this
> > > > >> > > > more, maybe
> > > > >> > > > > > > > we
> > > > >> > > > > > > > > > > > should just have a special registration RPC
> > that
> > > > >> contains
> > > > >> > > > that
> > > > >> > > > > > > > > > > information,
> > > > >> > > > > > > > > > > > to avoid sending it over the wire all the
> > time.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Does this also mean that MetadataFetch
> > request
> > > > >> will
> > > > >> > > > result in
> > > > >> > > > > > > > > > > > > a "write"/AppendEntries through the Raft
> > > > >> replication
> > > > >> > > > protocol
> > > > >> > > > > > > > > before
> > > > >> > > > > > > > > > > you
> > > > >> > > > > > > > > > > > > can send the associated MetadataFetch
> > Response?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > I think we should require the broker to be
> > out of
> > > > >> the
> > > > >> > > > Offline state
> > > > >> > > > > > > > > > > before
> > > > >> > > > > > > > > > > > allowing it to fetch metadata, yes.  So the
> > separate
> > > > >> > > > registration
> > > > >> > > > > > > > RPC
> > > > >> > > > > > > > > > > > should have completed first.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > In section "Broker State", you mention that
> > a
> > > > >> broker can
> > > > >> > > > > > > > transition
> > > > >> > > > > > > > > > to
> > > > >> > > > > > > > > > > > > online after it is caught with the
> > metadata. What
> > > > >> do you
> > > > >> > > > mean by
> > > > >> > > > > > > > > > this?
> > > > >> > > > > > > > > > > > > Metadata is always changing. How does the
> > broker
> > > > >> know
> > > > >> > > > that it is
> > > > >> > > > > > > > > > caught
> > > > >> > > > > > > > > > > > up
> > > > >> > > > > > > > > > > > > since it doesn't participate in the
> > consensus or
> > > > >> the
> > > > >> > > > advancement
> > > > >> > > > > > > > of
> > > > >> > > > > > > > > > the
> > > > >> > > > > > > > > > > > > highwatermark?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > That's a good point.  Being "caught up" is
> > somewhat
> > > > >> of a
> > > > >> > > > fuzzy
> > > > >> > > > > > > > > concept
> > > > >> > > > > > > > > > > > here, since the brokers do not participate in
> > the
> > > > >> metadata
> > > > >> > > > > > > > consensus.
> > > > >> > > > > > > > > > I
> > > > >> > > > > > > > > > > > think ideally we would want to define it in
> > terms
> > > > >> of time
> > > > >> > > > ("the
> > > > >> > > > > > > > > broker
> > > > >> > > > > > > > > > > has
> > > > >> > > > > > > > > > > > all the updates from the last 2 minutes", for
> > > > >> example.)
> > > > >> > > > We should
> > > > >> > > > > > > > > > spell
> > > > >> > > > > > > > > > > > this out better in the KIP.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > In section "Start the controller quorum
> > nodes",
> > > > >> you
> > > > >> > > > mention "Once
> > > > >> > > > > > > > > it
> > > > >> > > > > > > > > > > has
> > > > >> > > > > > > > > > > > > taken over the /controller node, the active
> > > > >> controller
> > > > >> > > > will
> > > > >> > > > > > > > proceed
> > > > >> > > > > > > > > > to
> > > > >> > > > > > > > > > > > load
> > > > >> > > > > > > > > > > > > the full state of ZooKeeper.  It will write
> > out
> > > > >> this
> > > > >> > > > information
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > the
> > > > >> > > > > > > > > > > > > quorum's metadata storage.  After this
> > point, the
> > > > >> > > > metadata quorum
> > > > >> > > > > > > > > > will
> > > > >> > > > > > > > > > > be
> > > > >> > > > > > > > > > > > > the metadata store of record, rather than
> > the
> > > > >> data in
> > > > >> > > > ZooKeeper."
> > > > >> > > > > > > > > > > During
> > > > >> > > > > > > > > > > > > this migration do should we expect to have a
> > > > >> small period
> > > > >> > > > > > > > > controller
> > > > >> > > > > > > > > > > > > unavailability while the controller
> > replicas this
> > > > >> state
> > > > >> > > > to all of
> > > > >> > > > > > > > > the
> > > > >> > > > > > > > > > > > raft
> > > > >> > > > > > > > > > > > > nodes in the controller quorum and we
> > buffer new
> > > > >> > > > controller API
> > > > >> > > > > > > > > > > requests?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Yes, the controller would be unavailable
> > during this
> > > > >> > > > time.  I don't
> > > > >> > > > > > > > > > think
> > > > >> > > > > > > > > > > > this will be that different from the current
> > period
> > > > >> of
> > > > >> > > > > > > > unavailability
> > > > >> > > > > > > > > > > when
> > > > >> > > > > > > > > > > > a new controller starts up and needs to load
> > the
> > > > >> full
> > > > >> > > > state from
> > > > >> > > > > > > > ZK.
> > > > >> > > > > > > > > > The
> > > > >> > > > > > > > > > > > main difference is that in this period, we'd
> > have
> > > > >> to write
> > > > >> > > > to the
> > > > >> > > > > > > > > > > > controller quorum rather than just to
> > memory.  But
> > > > >> we
> > > > >> > > > believe this
> > > > >> > > > > > > > > > should
> > > > >> > > > > > > > > > > > be pretty fast.
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > regards,
> > > > >> > > > > > > > > > > > Colin
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Thanks!
> > > > >> > > > > > > > > > > > > -Jose
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > David Arthur
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to