Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Boyang Chen Fri, 12 Jun 2020 09:26:29 -0700

Thanks for the suggestions Guozhang.

On Thu, Jun 11, 2020 at 2:51 PM Guozhang Wang <wangg...@gmail.com> wrote:


> Hello Boyang,
>
> Thanks for the updated information. A few questions here:
>
> 1) Should the quorum-file also update to support multi-raft?
>
> I'm neutral about this, as we don't know yet how the multi-raft modules
would behave. If
we have different threads operating different raft groups, consolidating
the `checkpoint` files seems
not reasonable. We could always add `multi-quorum-file` later if possible.

2) In the previous proposal, there's fields in the FetchQuorumRecords like
> latestDirtyOffset, is that dropped intentionally?
>
> I dropped the latestDirtyOffset since it is associated with the log
compaction discussion. This is beyond this KIP scope and we could
potentially get a separate KIP to talk about it.


> 3) I think we also need to elaborate a bit more details regarding when to
> send metadata request and discover-brokers; currently we only discussed
> during bootstrap how these requests would be sent. I think the following
> scenarios would also need these requests
>
> 3.a) As long as a broker does not know the current quorum (including the
> leader and the voters), it should continue periodically ask other brokers
> via "metadata.
> 3.b) As long as a broker does not know all the current quorum voter's
> connections, it should continue periodically ask other brokers via
> "discover-brokers".
> 3.c) When the leader's fetch timeout elapsed, it should send metadata
> request.
>
> Make sense, will add to the KIP.

>
> Guozhang
>
>
> On Wed, Jun 10, 2020 at 5:20 PM Boyang Chen <reluctanthero...@gmail.com>
> wrote:
>
> > Hey all,
> >
> > follow-up on the previous email, we made some more updates:
> >
> > 1. The Alter/DescribeQuorum RPCs are also re-structured to use
> multi-raft.
> >
> > 2. We add observer status into the DescribeQuorumResponse as we see it
> is a
> > low hanging fruit which is very useful for user debugging and
> reassignment.
> >
> > 3. The FindQuorum RPC is replaced with DiscoverBrokers RPC, which is
> purely
> > in charge of discovering broker connections in a gossip manner. The
> quorum
> > leader discovery is piggy-back on the Metadata RPC for the topic
> partition
> > leader, which in our case is the single metadata partition for the
> version
> > one.
> >
> > Let me know if you have any questions.
> >
> > Boyang
> >
> > On Tue, Jun 9, 2020 at 11:01 PM Boyang Chen <reluctanthero...@gmail.com>
> > wrote:
> >
> > > Hey all,
> > >
> > > Thanks for the great discussions so far. I'm posting some KIP updates
> > from
> > > our working group discussion:
> > >
> > > 1. We will be changing the core RPCs from single-raft API to
> multi-raft.
> > > This means all protocols will be "batch" in the first version, but the
> > KIP
> > > itself only illustrates the design for a single metadata topic
> partition.
> > > The reason is to "keep the door open" for future extensions of this
> piece
> > > of module such as a sharded controller or general quorum based topic
> > > replication, beyond the current Kafka replication protocol.
> > >
> > > 2. We will piggy-back on the current Kafka Fetch API instead of
> inventing
> > > a new FetchQuorumRecords RPC. The motivation is about the same as #1 as
> > > well as making the integration work easier, instead of letting two
> > similar
> > > RPCs diverge.
> > >
> > > 3. In the EndQuorumEpoch protocol, instead of only sending the request
> to
> > > the most caught-up voter, we shall broadcast the information to all
> > voters,
> > > with a sorted voter list in descending order of their corresponding
> > > replicated offset. In this way, the top voter will become a candidate
> > > immediately, while the other voters shall wait for an exponential
> > back-off
> > > to trigger elections, which helps ensure the top voter gets elected,
> and
> > > the election eventually happens when the top voter is not responsive.
> > >
> > > Please see the updated KIP and post any questions or concerns on the
> > > mailing thread.
> > >
> > > Boyang
> > >
> > > On Fri, May 8, 2020 at 5:26 PM Jun Rao <j...@confluent.io> wrote:
> > >
> > >> Hi, Guozhang and Jason,
> > >>
> > >> Thanks for the reply. A couple of more replies.
> > >>
> > >> 102. Still not sure about this. How is the tombstone issue addressed
> in
> > >> the
> > >> non-voter and the observer.  They can die at any point and restart at
> an
> > >> arbitrary later time, and the advancing of the firstDirty offset and
> the
> > >> removal of the tombstone can happen independently.
> > >>
> > >> 106. I agree that it would be less confusing if we used "epoch"
> instead
> > of
> > >> "leader epoch" consistently.
> > >>
> > >> Jun
> > >>
> > >> On Thu, May 7, 2020 at 4:04 PM Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >>
> > >> > Thanks Jun. Further replies are in-lined.
> > >> >
> > >> > On Mon, May 4, 2020 at 11:58 AM Jun Rao <j...@confluent.io> wrote:
> > >> >
> > >> > > Hi, Guozhang,
> > >> > >
> > >> > > Thanks for the reply. A few more replies inlined below.
> > >> > >
> > >> > > On Sun, May 3, 2020 at 6:33 PM Guozhang Wang <wangg...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hello Jun,
> > >> > > >
> > >> > > > Thanks for your comments! I'm replying inline below:
> > >> > > >
> > >> > > > On Fri, May 1, 2020 at 12:36 PM Jun Rao <j...@confluent.io>
> wrote:
> > >> > > >
> > >> > > > > 101. Bootstrapping related issues.
> > >> > > > > 101.1 Currently, we support auto broker id generation. Is this
> > >> > > supported
> > >> > > > > for bootstrap brokers?
> > >> > > > >
> > >> > > >
> > >> > > > The vote ids would just be the broker ids. "bootstrap.servers"
> > >> would be
> > >> > > > similar to what client configs have today, where "quorum.voters"
> > >> would
> > >> > be
> > >> > > > pre-defined config values.
> > >> > > >
> > >> > > >
> > >> > > My question was on the auto generated broker id. Currently, the
> > broker
> > >> > can
> > >> > > choose to have its broker Id auto generated. The generation is
> done
> > >> > through
> > >> > > ZK to guarantee uniqueness. Without ZK, it's not clear how the
> > broker
> > >> id
> > >> > is
> > >> > > auto generated. "quorum.voters" also can't be set statically if
> > broker
> > >> > ids
> > >> > > are auto generated.
> > >> > >
> > >> > > Jason has explained some ideas that we've discussed so far, the
> > >> reason we
> > >> > intentional did not include them so far is that we feel it is
> out-side
> > >> the
> > >> > scope of KIP-595. Under the umbrella of KIP-500 we should definitely
> > >> > address them though.
> > >> >
> > >> > On the high-level, our belief is that "joining a quorum" and
> "joining
> > >> (or
> > >> > more specifically, registering brokers in) the cluster" would be
> > >> > de-coupled a bit, where the former should be completed before we do
> > the
> > >> > latter. More specifically, assuming the quorum is already up and
> > >> running,
> > >> > after the newly started broker found the leader of the quorum it can
> > >> send a
> > >> > specific RegisterBroker request including its listener / protocol /
> > etc,
> > >> > and upon handling it the leader can send back the uniquely generated
> > >> broker
> > >> > id to the new broker, while also executing the "startNewBroker"
> > >> callback as
> > >> > the controller.
> > >> >
> > >> >
> > >> > >
> > >> > > > > 102. Log compaction. One weak spot of log compaction is for
> the
> > >> > > consumer
> > >> > > > to
> > >> > > > > deal with deletes. When a key is deleted, it's retained as a
> > >> > tombstone
> > >> > > > > first and then physically removed. If a client misses the
> > >> tombstone
> > >> > > > > (because it's physically removed), it may not be able to
> update
> > >> its
> > >> > > > > metadata properly. The way we solve this in Kafka is based on
> a
> > >> > > > > configuration (log.cleaner.delete.retention.ms) and we
> expect a
> > >> > > consumer
> > >> > > > > having seen an old key to finish reading the deletion
> tombstone
> > >> > within
> > >> > > > that
> > >> > > > > time. There is no strong guarantee for that since a broker
> could
> > >> be
> > >> > > down
> > >> > > > > for a long time. It would be better if we can have a more
> > reliable
> > >> > way
> > >> > > of
> > >> > > > > dealing with deletes.
> > >> > > > >
> > >> > > >
> > >> > > > We propose to capture this in the "FirstDirtyOffset" field of
> the
> > >> > quorum
> > >> > > > record fetch response: the offset is the maximum offset that log
> > >> > > compaction
> > >> > > > has reached up to. If the follower has fetched beyond this
> offset
> > it
> > >> > > means
> > >> > > > itself is safe hence it has seen all records up to that offset.
> On
> > >> > > getting
> > >> > > > the response, the follower can then decide if its end offset
> > >> actually
> > >> > > below
> > >> > > > that dirty offset (and hence may miss some tombstones). If
> that's
> > >> the
> > >> > > case:
> > >> > > >
> > >> > > > 1) Naively, it could re-bootstrap metadata log from the very
> > >> beginning
> > >> > to
> > >> > > > catch up.
> > >> > > > 2) During that time, it would refrain itself from answering
> > >> > > MetadataRequest
> > >> > > > from any clients.
> > >> > > >
> > >> > > >
> > >> > > I am not sure that the "FirstDirtyOffset" field fully addresses
> the
> > >> > issue.
> > >> > > Currently, the deletion tombstone is not removed immediately
> after a
> > >> > round
> > >> > > of cleaning. It's removed after a delay in a subsequent round of
> > >> > cleaning.
> > >> > > Consider an example where a key insertion is at offset 200 and a
> > >> deletion
> > >> > > tombstone of the key is at 400. Initially, FirstDirtyOffset is at
> > >> 300. A
> > >> > > follower/observer fetches from offset 0  and fetches the key at
> > offset
> > >> > 200.
> > >> > > A few rounds of cleaning happen. FirstDirtyOffset is at 500 and
> the
> > >> > > tombstone at 400 is physically removed. The follower/observer
> > >> continues
> > >> > the
> > >> > > fetch, but misses offset 400. It catches all the way to
> > >> FirstDirtyOffset
> > >> > > and declares its metadata as ready. However, its metadata could be
> > >> stale
> > >> > > since it actually misses the deletion of the key.
> > >> > >
> > >> > > Yeah good question, I should have put more details in my
> explanation
> > >> :)
> > >> >
> > >> > The idea is that we will adjust the log compaction for this raft
> based
> > >> > metadata log: before more details to be explained, since we have two
> > >> types
> > >> > of "watermarks" here, whereas in Kafka the watermark indicates where
> > >> every
> > >> > replica have replicated up to and in Raft the watermark indicates
> > where
> > >> the
> > >> > majority of replicas (here only indicating voters of the quorum, not
> > >> > counting observers) have replicated up to, let's call them Kafka
> > >> watermark
> > >> > and Raft watermark. For this special log, we would maintain both
> > >> > watermarks.
> > >> >
> > >> > When log compacting on the leader, we would only compact up to the
> > Kafka
> > >> > watermark, i.e. if there is at least one voter who have not
> replicated
> > >> an
> > >> > entry, it would not be compacted. The "dirty-offset" is the offset
> > that
> > >> > we've compacted up to and is communicated to other voters, and the
> > other
> > >> > voters would also compact up to this value --- i.e. the difference
> > here
> > >> is
> > >> > that instead of letting each replica doing log compaction
> > independently,
> > >> > we'll have the leader to decide upon which offset to compact to, and
> > >> > propagate this value to others to follow, in a more coordinated
> > manner.
> > >> > Also note when there are new voters joining the quorum who has not
> > >> > replicated up to the dirty-offset, of because of other issues they
> > >> > truncated their logs to below the dirty-offset, they'd have to
> > >> re-bootstrap
> > >> > from the beginning, and during this period of time the leader
> learned
> > >> about
> > >> > this lagging voter would not advance the watermark (also it would
> not
> > >> > decrement it), and hence not compacting either, until the voter(s)
> has
> > >> > caught up to that dirty-offset.
> > >> >
> > >> > So back to your example above, before the bootstrap voter gets to
> 300
> > no
> > >> > log compaction would happen on the leader; and until later when the
> > >> voter
> > >> > have got to beyond 400 and hence replicated that tombstone, the log
> > >> > compaction would possibly get to that tombstone and remove it. Say
> > >> later it
> > >> > the leader's log compaction reaches 500, it can send this back to
> the
> > >> voter
> > >> > who can then also compact locally up to 500.
> > >> >
> > >> >
> > >> > > > > 105. Quorum State: In addition to VotedId, do we need the
> epoch
> > >> > > > > corresponding to VotedId? Over time, the same broker Id could
> be
> > >> > voted
> > >> > > in
> > >> > > > > different generations with different epoch.
> > >> > > > >
> > >> > > > >
> > >> > > > Hmm, this is a good point. Originally I think the "LeaderEpoch"
> > >> field
> > >> > in
> > >> > > > that file is corresponding to the "latest known leader epoch",
> not
> > >> the
> > >> > > > "current leader epoch". For example, if the current epoch is N,
> > and
> > >> > then
> > >> > > a
> > >> > > > vote-request with epoch N+1 is received and the voter granted
> the
> > >> vote
> > >> > > for
> > >> > > > it, then it means for this voter it knows the "latest epoch" is
> N
> > +
> > >> 1
> > >> > > > although it is unknown if that sending candidate will indeed
> > become
> > >> the
> > >> > > new
> > >> > > > leader (which would only be notified via begin-quorum request).
> > >> > However,
> > >> > > > when persisting the quorum state, we would encode leader-epoch
> to
> > >> N+1,
> > >> > > > while the leaderId to be the older leader.
> > >> > > >
> > >> > > > But now thinking about this a bit more, I feel we should use two
> > >> > separate
> > >> > > > epochs, one for the "lates known" and one for the "current" to
> > pair
> > >> > with
> > >> > > > the leaderId. I will update the wiki page.
> > >> > > >
> > >> > > >
> > >> > > Hmm, it's kind of weird to bump up the leader epoch before the new
> > >> leader
> > >> > > is actually elected, right.
> > >> > >
> > >> > >
> > >> > > > > 106. "OFFSET_OUT_OF_RANGE: Used in the FetchQuorumRecords API
> to
> > >> > > indicate
> > >> > > > > that the follower has fetched from an invalid offset and
> should
> > >> > > truncate
> > >> > > > to
> > >> > > > > the offset/epoch indicated in the response." Observers can't
> > >> truncate
> > >> > > > their
> > >> > > > > logs. What should they do with OFFSET_OUT_OF_RANGE?
> > >> > > > >
> > >> > > > >
> > >> > > > I'm not sure if I understand your question? Observers should
> still
> > >> be
> > >> > > able
> > >> > > > to truncate their logs as well.
> > >> > > >
> > >> > > >
> > >> > > Hmm, I thought only the quorum nodes have local logs and observers
> > >> don't?
> > >> > >
> > >> > > > 107. "The leader will continue sending BeginQuorumEpoch to each
> > >> known
> > >> > > > voter
> > >> > > > > until it has received its endorsement." If a voter is down
> for a
> > >> long
> > >> > > > time,
> > >> > > > > sending BeginQuorumEpoch seems to add unnecessary overhead.
> > >> > Similarly,
> > >> > > > if a
> > >> > > > > follower stops sending FetchQuorumRecords, does the leader
> keep
> > >> > sending
> > >> > > > > BeginQuorumEpoch?
> > >> > > > >
> > >> > > >
> > >> > > > Regarding BeginQuorumEpoch: that is a good point. The
> > >> > begin-quorum-epoch
> > >> > > > request is for voters to quickly get the new leader information;
> > >> > however
> > >> > > > even if they do not get them they can still eventually learn
> about
> > >> that
> > >> > > > from others via gossiping FindQuorum. I think we can adjust the
> > >> logic
> > >> > to
> > >> > > > e.g. exponential back-off or with a limited num.retries.
> > >> > > >
> > >> > > > Regarding FetchQuorumRecords: if the follower sends
> > >> FetchQuorumRecords
> > >> > > > already, it means that follower already knows that the broker is
> > the
> > >> > > > leader, and hence we can stop retrying BeginQuorumEpoch; however
> > it
> > >> is
> > >> > > > possible that after a follower sends FetchQuorumRecords already,
> > >> > suddenly
> > >> > > > it stops send it (possibly because it learned about a higher
> epoch
> > >> > > leader),
> > >> > > > and hence this broker may be a "zombie" leader and we propose to
> > use
> > >> > the
> > >> > > > fetch.timeout to let the leader to try to verify if it has
> already
> > >> been
> > >> > > > stale.
> > >> > > >
> > >> > > >
> > >> > > It just seems that we should handle these two cases in a
> consistent
> > >> way?
> > >> > >
> > >> > > Yes I agree, on the leader's side, the FetchQuorumRecords from a
> > >> follower
> > >> > could mean that we no longer needs to send BeginQuorumEpoch anymore
> > ---
> > >> and
> > >> > it is already part of our current implementations in
> > >> > https://github.com/confluentinc/kafka/commits/kafka-raft
> > >> >
> > >> >
> > >> > > Thanks,
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > > >
> > >> > > > >
> > >> > > > > Jun
> > >> > > > >
> > >> > > > > On Wed, Apr 29, 2020 at 8:56 PM Guozhang Wang <
> > wangg...@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > Hello Leonard,
> > >> > > > > >
> > >> > > > > > Thanks for your comments, I'm relying in line below:
> > >> > > > > >
> > >> > > > > > On Wed, Apr 29, 2020 at 1:58 AM Wang (Leonard) Ge <
> > >> > w...@confluent.io>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi Kafka developers,
> > >> > > > > > >
> > >> > > > > > > It's great to see this proposal and it took me some time
> to
> > >> > finish
> > >> > > > > > reading
> > >> > > > > > > it.
> > >> > > > > > >
> > >> > > > > > > And I have the following questions about the Proposal:
> > >> > > > > > >
> > >> > > > > > >    - How do we plan to test this design to ensure its
> > >> > correctness?
> > >> > > Or
> > >> > > > > > more
> > >> > > > > > >    broadly, how do we ensure that our new ‘pull’ based
> model
> > >> is
> > >> > > > > > functional
> > >> > > > > > > and
> > >> > > > > > >    correct given that it is different from the original
> RAFT
> > >> > > > > > implementation
> > >> > > > > > >    which has formal proof of correctness?
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > We have two planned verifications on the correctness and
> > >> liveness
> > >> > of
> > >> > > > the
> > >> > > > > > design. One is via model verification (TLA+)
> > >> > > > > > https://github.com/guozhangwang/kafka-specification
> > >> > > > > >
> > >> > > > > > Another is via the concurrent simulation tests
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/confluentinc/kafka/commit/5c0c054597d2d9f458cad0cad846b0671efa2d91
> > >> > > > > >
> > >> > > > > >    - Have we considered any sensible defaults for the
> > >> > configuration,
> > >> > > > i.e.
> > >> > > > > > >    all the election timeout, fetch time out, etc.? Or we
> > want
> > >> to
> > >> > > > leave
> > >> > > > > > > this to
> > >> > > > > > >    a later stage when we do the performance testing, etc.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > This is a good question, the reason we did not set any
> default
> > >> > values
> > >> > > > for
> > >> > > > > > the timeout configurations is that we think it may take some
> > >> > > > benchmarking
> > >> > > > > > experiments to get these defaults right. Some high-level
> > >> principles
> > >> > > to
> > >> > > > > > consider: 1) the fetch.timeout should be around the same
> scale
> > >> with
> > >> > > zk
> > >> > > > > > session timeout, which is now 18 seconds by default -- in
> > >> practice
> > >> > > > we've
> > >> > > > > > seen unstable networks having more than 10 secs of transient
> > >> > > > > connectivity,
> > >> > > > > > 2) the election.timeout, however, should be smaller than the
> > >> fetch
> > >> > > > > timeout
> > >> > > > > > as is also suggested as a practical optimization in
> > literature:
> > >> > > > > >
> https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf
> > >> > > > > >
> > >> > > > > > Some more discussions can be found here:
> > >> > > > > >
> > https://github.com/confluentinc/kafka/pull/301/files#r415420081
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > >    - Have we considered piggybacking `BeginQuorumEpoch`
> with
> > >> the
> > >> > `
> > >> > > > > > >    FetchQuorumRecords`? I might be missing something
> obvious
> > >> but
> > >> > I
> > >> > > am
> > >> > > > > > just
> > >> > > > > > >    wondering why don’t we just use the `FindQuorum` and
> > >> > > > > > > `FetchQuorumRecords`
> > >> > > > > > >    APIs and remove the `BeginQuorumEpoch` API?
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > Note that Begin/EndQuorumEpoch is sent from leader -> other
> > >> voter
> > >> > > > > > followers, while FindQuorum / Fetch are sent from follower
> to
> > >> > leader.
> > >> > > > > > Arguably one can eventually realize the new leader and epoch
> > via
> > >> > > > > gossiping
> > >> > > > > > FindQuorum, but that could in practice require a long delay.
> > >> > Having a
> > >> > > > > > leader -> other voters request helps the new leader epoch to
> > be
> > >> > > > > propagated
> > >> > > > > > faster under a pull model.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > >    - And about the `FetchQuorumRecords` response schema,
> in
> > >> the
> > >> > > > > `Records`
> > >> > > > > > >    field of the response, is it just one record or all the
> > >> > records
> > >> > > > > > starting
> > >> > > > > > >    from the FetchOffset? It seems a lot more efficient if
> we
> > >> sent
> > >> > > all
> > >> > > > > the
> > >> > > > > > >    records during the bootstrapping of the brokers.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > Yes the fetching is batched: FetchOffset is just the
> starting
> > >> > offset
> > >> > > of
> > >> > > > > the
> > >> > > > > > batch of records.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > >    - Regarding the disruptive broker issues, does our pull
> > >> based
> > >> > > > model
> > >> > > > > > >    suffer from it? If so, have we considered the Pre-Vote
> > >> stage?
> > >> > If
> > >> > > > > not,
> > >> > > > > > > why?
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > The disruptive broker is stated in the original Raft paper
> > >> which is
> > >> > > the
> > >> > > > > > result of the push model design. Our analysis showed that
> with
> > >> the
> > >> > > pull
> > >> > > > > > model it is no longer an issue.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > > Thanks a lot for putting this up, and I hope that my
> > questions
> > >> > can
> > >> > > be
> > >> > > > > of
> > >> > > > > > > some value to make this KIP better.
> > >> > > > > > >
> > >> > > > > > > Hope to hear from you soon!
> > >> > > > > > >
> > >> > > > > > > Best wishes,
> > >> > > > > > > Leonard
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Wed, Apr 29, 2020 at 1:46 AM Colin McCabe <
> > >> cmcc...@apache.org
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi Jason,
> > >> > > > > > > >
> > >> > > > > > > > It's amazing to see this coming together :)
> > >> > > > > > > >
> > >> > > > > > > > I haven't had a chance to read in detail, but I read the
> > >> > outline
> > >> > > > and
> > >> > > > > a
> > >> > > > > > > few
> > >> > > > > > > > things jumped out at me.
> > >> > > > > > > >
> > >> > > > > > > > First, for every epoch that is 32 bits rather than 64, I
> > >> sort
> > >> > of
> > >> > > > > wonder
> > >> > > > > > > if
> > >> > > > > > > > that's a good long-term choice.  I keep reading about
> > stuff
> > >> > like
> > >> > > > > this:
> > >> > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1277 .
> > >> > > Obviously,
> > >> > > > > > that
> > >> > > > > > > > JIRA is about zxid, which increments much faster than we
> > >> expect
> > >> > > > these
> > >> > > > > > > > leader epochs to, but it would still be good to see some
> > >> rough
> > >> > > > > > > calculations
> > >> > > > > > > > about how long 32 bits (or really, 31 bits) will last us
> > in
> > >> the
> > >> > > > cases
> > >> > > > > > > where
> > >> > > > > > > > we're using it, and what the space savings we're getting
> > >> really
> > >> > > is.
> > >> > > > > It
> > >> > > > > > > > seems like in most cases the tradeoff may not be worth
> it?
> > >> > > > > > > >
> > >> > > > > > > > Another thing I've been thinking about is how we do
> > >> > > > bootstrapping.  I
> > >> > > > > > > > would prefer to be in a world where formatting a new
> Kafka
> > >> node
> > >> > > > was a
> > >> > > > > > > first
> > >> > > > > > > > class operation explicitly initiated by the admin,
> rather
> > >> than
> > >> > > > > > something
> > >> > > > > > > > that happened implicitly when you started up the broker
> > and
> > >> > > things
> > >> > > > > > > "looked
> > >> > > > > > > > blank."
> > >> > > > > > > >
> > >> > > > > > > > The first problem is that things can "look blank"
> > >> accidentally
> > >> > if
> > >> > > > the
> > >> > > > > > > > storage system is having a bad day.  Clearly in the
> > non-Raft
> > >> > > world,
> > >> > > > > > this
> > >> > > > > > > > leads to data loss if the broker that is (re)started
> this
> > >> way
> > >> > was
> > >> > > > the
> > >> > > > > > > > leader for some partitions.
> > >> > > > > > > >
> > >> > > > > > > > The second problem is that we have a bit of a chicken
> and
> > >> egg
> > >> > > > problem
> > >> > > > > > > with
> > >> > > > > > > > certain configuration keys.  For example, maybe you want
> > to
> > >> > > > configure
> > >> > > > > > > some
> > >> > > > > > > > connection security settings in your cluster, but you
> > don't
> > >> > want
> > >> > > > them
> > >> > > > > > to
> > >> > > > > > > > ever be stored in a plaintext config file.  (For
> example,
> > >> SCRAM
> > >> > > > > > > passwords,
> > >> > > > > > > > etc.)  You could use a broker API to set the
> > configuration,
> > >> but
> > >> > > > that
> > >> > > > > > > brings
> > >> > > > > > > > up the chicken and egg problem.  The broker needs to be
> > >> > > configured
> > >> > > > to
> > >> > > > > > > know
> > >> > > > > > > > how to talk to you, but you need to configure it before
> > you
> > >> can
> > >> > > > talk
> > >> > > > > to
> > >> > > > > > > > it.  Using an external secret manager like Vault is one
> > way
> > >> to
> > >> > > > solve
> > >> > > > > > > this,
> > >> > > > > > > > but not everyone uses an external secret manager.
> > >> > > > > > > >
> > >> > > > > > > > quorum.voters seems like a similar configuration key.
> In
> > >> the
> > >> > > > current
> > >> > > > > > > KIP,
> > >> > > > > > > > this is only read if there is no other configuration
> > >> specifying
> > >> > > the
> > >> > > > > > > quorum
> > >> > > > > > > > voter set.  If we had a kafka.mkfs command, we wouldn't
> > need
> > >> > this
> > >> > > > key
> > >> > > > > > > > because we could assume that there was always quorum
> > >> > information
> > >> > > > > stored
> > >> > > > > > > > locally.
> > >> > > > > > > >
> > >> > > > > > > > best,
> > >> > > > > > > > Colin
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Apr 16, 2020, at 16:44, Jason Gustafson wrote:
> > >> > > > > > > > > Hi All,
> > >> > > > > > > > >
> > >> > > > > > > > > I'd like to start a discussion on KIP-595:
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > >> > > > > > > > .
> > >> > > > > > > > > This proposal specifies a Raft protocol to ultimately
> > >> replace
> > >> > > > > > Zookeeper
> > >> > > > > > > > > as
> > >> > > > > > > > > documented in KIP-500. Please take a look and share
> your
> > >> > > > thoughts.
> > >> > > > > > > > >
> > >> > > > > > > > > A few minor notes to set the stage a little bit:
> > >> > > > > > > > >
> > >> > > > > > > > > - This KIP does not specify the structure of the
> > messages
> > >> > used
> > >> > > to
> > >> > > > > > > > represent
> > >> > > > > > > > > metadata in Kafka, nor does it specify the internal
> API
> > >> that
> > >> > > will
> > >> > > > > be
> > >> > > > > > > used
> > >> > > > > > > > > by the controller. Expect these to come in later
> > >> proposals.
> > >> > > Here
> > >> > > > we
> > >> > > > > > are
> > >> > > > > > > > > primarily concerned with the replication protocol and
> > >> basic
> > >> > > > > > operational
> > >> > > > > > > > > mechanics.
> > >> > > > > > > > > - We expect many details to change as we get closer to
> > >> > > > integration
> > >> > > > > > with
> > >> > > > > > > > > the controller. Any changes we make will be made
> either
> > as
> > >> > > > > amendments
> > >> > > > > > > to
> > >> > > > > > > > > this KIP or, in the case of larger changes, as new
> > >> proposals.
> > >> > > > > > > > > - We have a prototype implementation which I will put
> > >> online
> > >> > > > within
> > >> > > > > > the
> > >> > > > > > > > > next week which may help in understanding some
> details.
> > It
> > >> > has
> > >> > > > > > > diverged a
> > >> > > > > > > > > little bit from our proposal, so I am taking a little
> > >> time to
> > >> > > > bring
> > >> > > > > > it
> > >> > > > > > > in
> > >> > > > > > > > > line. I'll post an update to this thread when it is
> > >> available
> > >> > > for
> > >> > > > > > > review.
> > >> > > > > > > > >
> > >> > > > > > > > > Finally, I want to mention that this proposal was
> > drafted
> > >> by
> > >> > > > > myself,
> > >> > > > > > > > Boyang
> > >> > > > > > > > > Chen, and Guozhang Wang.
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks,
> > >> > > > > > > > > Jason
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > Leonard Ge
> > >> > > > > > > Software Engineer Intern - Confluent
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > -- Guozhang
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > -- Guozhang
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > -- Guozhang
> > >> >
> > >>
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to