Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Tom Bentley Tue, 14 Jul 2020 04:11:19 -0700

Hi Jason, Ghouzang and Boyang,

First, thanks for the KIP. I've not a few minor suggestions and nits, but
on the whole it was pretty clear and understandable.


1. § Motivation

    "We have intentionally avoided any assumption about the representation
of the log and its semantics."

I found this a bit confusing because there's a whole section about the
assumed log structure later on. I think something like:

    "We make no assumption about the physical log format and minimal
assumptions about it's logical structure (see §Log Structure)."

might make things a little clearer, though that suggestion does omit any
claim about semantics.

Furthermore, in the "Log Structure" section it ends with "Kafka's current
v2 message format version supports everything we need, so we will assume
that."
All you need to say here is that "the v2 message format has the
assumed/required properties", saying you're assuming the v2 message format
negates the rest of the paragraph.


2. § Configurations

The format for voters in the `quorum.voters` is potentially a little
confusing because the separation of the broker id from the broker host name
with a dot means the whole thing is syntactically indistinguishable from a
host name. Using a different separator would avoid such ambiguity.


3. § Leader Election and Data Replication

    "The key functionalities of any consensus protocol are leader election
and data replication. The protocol for these two functionalities consists
of 5 core RPCs:"

This sentence is followed by a list of 4 (rather than 5) items. I don't
think you meant to imply DescribeQuorum is a core RPC.


4. § BeginQuorumEpoch

    "observers will discover the new leader through either
the DiscoverBrokers or Fetch APIs"

DiscoverBrokers isn't defined in this KIP (I think it was when I read an
earlier version), and I don't recall seeing another KIP which describes it.


5. § EndQuorumEpoch

    "If the node's priority is highest, it will become candidate
immediately instead of waiting for next poll."

This is clarified in the later section on EndquorumEpoch request handling,
but I found the wording " instead of waiting for next poll" to be
confusing. Maybe "instead of waiting for next election timeout" would be
clearer?


6. § Fetch

The diagram from the Raft dissertation is a broken link (I'm guessing for
anyone lacking access to https://confluentinc.atlassian.net).


7. § Fetch

    "This proposal also extends the Fetch response to include a separate
field for the receiver of the request to indicate the current leader."

I think "receiver of the request" should be "receiver of the response",
shouldn't it?


8. § Fetch § Response Schema

The `NextOffsetAndEpoch` and `CurrentLeader` field in the fetch response
lack `about` properties.


9. § FetchRequest Handling

Item 1.a. is covering only the case where the fetcher is a voter (thus in
receipt of BeginQuorumEpoch), it doesn't cover observer fetchers.

Also this algorithm talks about "the leader epoch" being larger or smaller,
but it's not immediately clear whether this is the leader's epoch or the
leader epoch in the fetch request and which is the epoch being compared
against. It makes the comparisons harder to understand if you have to guess
the order of the operands.


10. § Discussion: Replication Progress Timeout for Zombie Leader

    "Note that the node will remain a leader until it finds that it has
been supplanted by another voter."

I don't quite follow this: Why is a _voter_ involved here? Surely it's
simply a matter of the leader observing that a new leader exists which
determines whether it steps down.


11. § DescribeQuorum

This seems to be defined also in KIP-642.


12. § Tooling support

`kafka-metadata-quorum.sh`: Given that the protocol is essentially
independent of the partitions for which the quorum is established
it seems a little odd to name the tool so that it can't later be extended
to support quorums for other partitions.
For example why not `kafka-quorum.sh --metadata --describe` and later
support `kafka-quorum.sh --topic-partitions foo-0,foo-1 --describe`.
Obviously you don't need `--metadata` (it could be the default if
`--topic-partitions` was omitted), it's more a point about the name of the
tool.

Again in this section, on the one hand the list says `--describe` supports
two arguments (status and replication).
On the other hand the example shows a plain, argumentless `--describe`.


13. § Metrics

i. I wonder about including `raft` in the metric names, since it's not
really exposed elsewhere in the public API.

ii. Similarly to the tool/script, we end up with metrics which are
implicitly about a partition. If we use quorums for other partitions in the
future we'll need to replicate a lot of these metrics, which would make it
harder to abstract across collections of metrics.


Thanks again for the KIP,

Tom



On Mon, Jul 13, 2020 at 5:50 PM Jason Gustafson <[email protected]> wrote:

> Hi All,
>
> Just a quick update on the proposal. We have decided to move quorum
> reassignment to a separate KIP:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-642%3A+Dynamic+quorum+reassignment
> .
> The way this ties into cluster bootstrapping is complicated, so we felt we
> needed a bit more time for validation. That leaves the core of this
> proposal as quorum-based replication. If there are no further comments, we
> will plan to start a vote later this week.
>
> Thanks,
> Jason
>
> On Wed, Jun 24, 2020 at 10:43 AM Guozhang Wang <[email protected]> wrote:
>
> > @Jun Rao <[email protected]>
> >
> > Regarding your comment about log compaction. After some deep-diving into
> > this we've decided to propose a new snapshot-based log cleaning mechanism
> > which would be used to replace the current compaction mechanism for this
> > meta log. A new KIP will be proposed specifically for this idea.
> >
> > All,
> >
> > I've updated the KIP wiki a bit updating one config "
> > election.jitter.max.ms"
> > to "election.backoff.max.ms" to make it more clear about the usage: the
> > configured value will be the upper bound of the binary exponential
> backoff
> > time after a failed election, before starting a new one.
> >
> >
> >
> > Guozhang
> >
> >
> >
> > On Fri, Jun 12, 2020 at 9:26 AM Boyang Chen <[email protected]>
> > wrote:
> >
> > > Thanks for the suggestions Guozhang.
> > >
> > > On Thu, Jun 11, 2020 at 2:51 PM Guozhang Wang <[email protected]>
> > wrote:
> > >
> > > > Hello Boyang,
> > > >
> > > > Thanks for the updated information. A few questions here:
> > > >
> > > > 1) Should the quorum-file also update to support multi-raft?
> > > >
> > > > I'm neutral about this, as we don't know yet how the multi-raft
> modules
> > > would behave. If
> > > we have different threads operating different raft groups,
> consolidating
> > > the `checkpoint` files seems
> > > not reasonable. We could always add `multi-quorum-file` later if
> > possible.
> > >
> > > 2) In the previous proposal, there's fields in the FetchQuorumRecords
> > like
> > > > latestDirtyOffset, is that dropped intentionally?
> > > >
> > > > I dropped the latestDirtyOffset since it is associated with the log
> > > compaction discussion. This is beyond this KIP scope and we could
> > > potentially get a separate KIP to talk about it.
> > >
> > >
> > > > 3) I think we also need to elaborate a bit more details regarding
> when
> > to
> > > > send metadata request and discover-brokers; currently we only
> discussed
> > > > during bootstrap how these requests would be sent. I think the
> > following
> > > > scenarios would also need these requests
> > > >
> > > > 3.a) As long as a broker does not know the current quorum (including
> > the
> > > > leader and the voters), it should continue periodically ask other
> > brokers
> > > > via "metadata.
> > > > 3.b) As long as a broker does not know all the current quorum voter's
> > > > connections, it should continue periodically ask other brokers via
> > > > "discover-brokers".
> > > > 3.c) When the leader's fetch timeout elapsed, it should send metadata
> > > > request.
> > > >
> > > > Make sense, will add to the KIP.
> > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Wed, Jun 10, 2020 at 5:20 PM Boyang Chen <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > follow-up on the previous email, we made some more updates:
> > > > >
> > > > > 1. The Alter/DescribeQuorum RPCs are also re-structured to use
> > > > multi-raft.
> > > > >
> > > > > 2. We add observer status into the DescribeQuorumResponse as we see
> > it
> > > > is a
> > > > > low hanging fruit which is very useful for user debugging and
> > > > reassignment.
> > > > >
> > > > > 3. The FindQuorum RPC is replaced with DiscoverBrokers RPC, which
> is
> > > > purely
> > > > > in charge of discovering broker connections in a gossip manner. The
> > > > quorum
> > > > > leader discovery is piggy-back on the Metadata RPC for the topic
> > > > partition
> > > > > leader, which in our case is the single metadata partition for the
> > > > version
> > > > > one.
> > > > >
> > > > > Let me know if you have any questions.
> > > > >
> > > > > Boyang
> > > > >
> > > > > On Tue, Jun 9, 2020 at 11:01 PM Boyang Chen <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hey all,
> > > > > >
> > > > > > Thanks for the great discussions so far. I'm posting some KIP
> > updates
> > > > > from
> > > > > > our working group discussion:
> > > > > >
> > > > > > 1. We will be changing the core RPCs from single-raft API to
> > > > multi-raft.
> > > > > > This means all protocols will be "batch" in the first version,
> but
> > > the
> > > > > KIP
> > > > > > itself only illustrates the design for a single metadata topic
> > > > partition.
> > > > > > The reason is to "keep the door open" for future extensions of
> this
> > > > piece
> > > > > > of module such as a sharded controller or general quorum based
> > topic
> > > > > > replication, beyond the current Kafka replication protocol.
> > > > > >
> > > > > > 2. We will piggy-back on the current Kafka Fetch API instead of
> > > > inventing
> > > > > > a new FetchQuorumRecords RPC. The motivation is about the same as
> > #1
> > > as
> > > > > > well as making the integration work easier, instead of letting
> two
> > > > > similar
> > > > > > RPCs diverge.
> > > > > >
> > > > > > 3. In the EndQuorumEpoch protocol, instead of only sending the
> > > request
> > > > to
> > > > > > the most caught-up voter, we shall broadcast the information to
> all
> > > > > voters,
> > > > > > with a sorted voter list in descending order of their
> corresponding
> > > > > > replicated offset. In this way, the top voter will become a
> > candidate
> > > > > > immediately, while the other voters shall wait for an exponential
> > > > > back-off
> > > > > > to trigger elections, which helps ensure the top voter gets
> > elected,
> > > > and
> > > > > > the election eventually happens when the top voter is not
> > responsive.
> > > > > >
> > > > > > Please see the updated KIP and post any questions or concerns on
> > the
> > > > > > mailing thread.
> > > > > >
> > > > > > Boyang
> > > > > >
> > > > > > On Fri, May 8, 2020 at 5:26 PM Jun Rao <[email protected]> wrote:
> > > > > >
> > > > > >> Hi, Guozhang and Jason,
> > > > > >>
> > > > > >> Thanks for the reply. A couple of more replies.
> > > > > >>
> > > > > >> 102. Still not sure about this. How is the tombstone issue
> > addressed
> > > > in
> > > > > >> the
> > > > > >> non-voter and the observer.  They can die at any point and
> restart
> > > at
> > > > an
> > > > > >> arbitrary later time, and the advancing of the firstDirty offset
> > and
> > > > the
> > > > > >> removal of the tombstone can happen independently.
> > > > > >>
> > > > > >> 106. I agree that it would be less confusing if we used "epoch"
> > > > instead
> > > > > of
> > > > > >> "leader epoch" consistently.
> > > > > >>
> > > > > >> Jun
> > > > > >>
> > > > > >> On Thu, May 7, 2020 at 4:04 PM Guozhang Wang <
> [email protected]>
> > > > > wrote:
> > > > > >>
> > > > > >> > Thanks Jun. Further replies are in-lined.
> > > > > >> >
> > > > > >> > On Mon, May 4, 2020 at 11:58 AM Jun Rao <[email protected]>
> > wrote:
> > > > > >> >
> > > > > >> > > Hi, Guozhang,
> > > > > >> > >
> > > > > >> > > Thanks for the reply. A few more replies inlined below.
> > > > > >> > >
> > > > > >> > > On Sun, May 3, 2020 at 6:33 PM Guozhang Wang <
> > > [email protected]>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Hello Jun,
> > > > > >> > > >
> > > > > >> > > > Thanks for your comments! I'm replying inline below:
> > > > > >> > > >
> > > > > >> > > > On Fri, May 1, 2020 at 12:36 PM Jun Rao <[email protected]
> >
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > 101. Bootstrapping related issues.
> > > > > >> > > > > 101.1 Currently, we support auto broker id generation.
> Is
> > > this
> > > > > >> > > supported
> > > > > >> > > > > for bootstrap brokers?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > The vote ids would just be the broker ids.
> > "bootstrap.servers"
> > > > > >> would be
> > > > > >> > > > similar to what client configs have today, where
> > > "quorum.voters"
> > > > > >> would
> > > > > >> > be
> > > > > >> > > > pre-defined config values.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > My question was on the auto generated broker id. Currently,
> > the
> > > > > broker
> > > > > >> > can
> > > > > >> > > choose to have its broker Id auto generated. The generation
> is
> > > > done
> > > > > >> > through
> > > > > >> > > ZK to guarantee uniqueness. Without ZK, it's not clear how
> the
> > > > > broker
> > > > > >> id
> > > > > >> > is
> > > > > >> > > auto generated. "quorum.voters" also can't be set statically
> > if
> > > > > broker
> > > > > >> > ids
> > > > > >> > > are auto generated.
> > > > > >> > >
> > > > > >> > > Jason has explained some ideas that we've discussed so far,
> > the
> > > > > >> reason we
> > > > > >> > intentional did not include them so far is that we feel it is
> > > > out-side
> > > > > >> the
> > > > > >> > scope of KIP-595. Under the umbrella of KIP-500 we should
> > > definitely
> > > > > >> > address them though.
> > > > > >> >
> > > > > >> > On the high-level, our belief is that "joining a quorum" and
> > > > "joining
> > > > > >> (or
> > > > > >> > more specifically, registering brokers in) the cluster" would
> be
> > > > > >> > de-coupled a bit, where the former should be completed before
> we
> > > do
> > > > > the
> > > > > >> > latter. More specifically, assuming the quorum is already up
> and
> > > > > >> running,
> > > > > >> > after the newly started broker found the leader of the quorum
> it
> > > can
> > > > > >> send a
> > > > > >> > specific RegisterBroker request including its listener /
> > protocol
> > > /
> > > > > etc,
> > > > > >> > and upon handling it the leader can send back the uniquely
> > > generated
> > > > > >> broker
> > > > > >> > id to the new broker, while also executing the
> "startNewBroker"
> > > > > >> callback as
> > > > > >> > the controller.
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > > > 102. Log compaction. One weak spot of log compaction is
> > for
> > > > the
> > > > > >> > > consumer
> > > > > >> > > > to
> > > > > >> > > > > deal with deletes. When a key is deleted, it's retained
> > as a
> > > > > >> > tombstone
> > > > > >> > > > > first and then physically removed. If a client misses
> the
> > > > > >> tombstone
> > > > > >> > > > > (because it's physically removed), it may not be able to
> > > > update
> > > > > >> its
> > > > > >> > > > > metadata properly. The way we solve this in Kafka is
> based
> > > on
> > > > a
> > > > > >> > > > > configuration (log.cleaner.delete.retention.ms) and we
> > > > expect a
> > > > > >> > > consumer
> > > > > >> > > > > having seen an old key to finish reading the deletion
> > > > tombstone
> > > > > >> > within
> > > > > >> > > > that
> > > > > >> > > > > time. There is no strong guarantee for that since a
> broker
> > > > could
> > > > > >> be
> > > > > >> > > down
> > > > > >> > > > > for a long time. It would be better if we can have a
> more
> > > > > reliable
> > > > > >> > way
> > > > > >> > > of
> > > > > >> > > > > dealing with deletes.
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > We propose to capture this in the "FirstDirtyOffset" field
> > of
> > > > the
> > > > > >> > quorum
> > > > > >> > > > record fetch response: the offset is the maximum offset
> that
> > > log
> > > > > >> > > compaction
> > > > > >> > > > has reached up to. If the follower has fetched beyond this
> > > > offset
> > > > > it
> > > > > >> > > means
> > > > > >> > > > itself is safe hence it has seen all records up to that
> > > offset.
> > > > On
> > > > > >> > > getting
> > > > > >> > > > the response, the follower can then decide if its end
> offset
> > > > > >> actually
> > > > > >> > > below
> > > > > >> > > > that dirty offset (and hence may miss some tombstones). If
> > > > that's
> > > > > >> the
> > > > > >> > > case:
> > > > > >> > > >
> > > > > >> > > > 1) Naively, it could re-bootstrap metadata log from the
> very
> > > > > >> beginning
> > > > > >> > to
> > > > > >> > > > catch up.
> > > > > >> > > > 2) During that time, it would refrain itself from
> answering
> > > > > >> > > MetadataRequest
> > > > > >> > > > from any clients.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > I am not sure that the "FirstDirtyOffset" field fully
> > addresses
> > > > the
> > > > > >> > issue.
> > > > > >> > > Currently, the deletion tombstone is not removed immediately
> > > > after a
> > > > > >> > round
> > > > > >> > > of cleaning. It's removed after a delay in a subsequent
> round
> > of
> > > > > >> > cleaning.
> > > > > >> > > Consider an example where a key insertion is at offset 200
> > and a
> > > > > >> deletion
> > > > > >> > > tombstone of the key is at 400. Initially, FirstDirtyOffset
> is
> > > at
> > > > > >> 300. A
> > > > > >> > > follower/observer fetches from offset 0  and fetches the key
> > at
> > > > > offset
> > > > > >> > 200.
> > > > > >> > > A few rounds of cleaning happen. FirstDirtyOffset is at 500
> > and
> > > > the
> > > > > >> > > tombstone at 400 is physically removed. The
> follower/observer
> > > > > >> continues
> > > > > >> > the
> > > > > >> > > fetch, but misses offset 400. It catches all the way to
> > > > > >> FirstDirtyOffset
> > > > > >> > > and declares its metadata as ready. However, its metadata
> > could
> > > be
> > > > > >> stale
> > > > > >> > > since it actually misses the deletion of the key.
> > > > > >> > >
> > > > > >> > > Yeah good question, I should have put more details in my
> > > > explanation
> > > > > >> :)
> > > > > >> >
> > > > > >> > The idea is that we will adjust the log compaction for this
> raft
> > > > based
> > > > > >> > metadata log: before more details to be explained, since we
> have
> > > two
> > > > > >> types
> > > > > >> > of "watermarks" here, whereas in Kafka the watermark indicates
> > > where
> > > > > >> every
> > > > > >> > replica have replicated up to and in Raft the watermark
> > indicates
> > > > > where
> > > > > >> the
> > > > > >> > majority of replicas (here only indicating voters of the
> quorum,
> > > not
> > > > > >> > counting observers) have replicated up to, let's call them
> Kafka
> > > > > >> watermark
> > > > > >> > and Raft watermark. For this special log, we would maintain
> both
> > > > > >> > watermarks.
> > > > > >> >
> > > > > >> > When log compacting on the leader, we would only compact up to
> > the
> > > > > Kafka
> > > > > >> > watermark, i.e. if there is at least one voter who have not
> > > > replicated
> > > > > >> an
> > > > > >> > entry, it would not be compacted. The "dirty-offset" is the
> > offset
> > > > > that
> > > > > >> > we've compacted up to and is communicated to other voters, and
> > the
> > > > > other
> > > > > >> > voters would also compact up to this value --- i.e. the
> > difference
> > > > > here
> > > > > >> is
> > > > > >> > that instead of letting each replica doing log compaction
> > > > > independently,
> > > > > >> > we'll have the leader to decide upon which offset to compact
> to,
> > > and
> > > > > >> > propagate this value to others to follow, in a more
> coordinated
> > > > > manner.
> > > > > >> > Also note when there are new voters joining the quorum who has
> > not
> > > > > >> > replicated up to the dirty-offset, of because of other issues
> > they
> > > > > >> > truncated their logs to below the dirty-offset, they'd have to
> > > > > >> re-bootstrap
> > > > > >> > from the beginning, and during this period of time the leader
> > > > learned
> > > > > >> about
> > > > > >> > this lagging voter would not advance the watermark (also it
> > would
> > > > not
> > > > > >> > decrement it), and hence not compacting either, until the
> > voter(s)
> > > > has
> > > > > >> > caught up to that dirty-offset.
> > > > > >> >
> > > > > >> > So back to your example above, before the bootstrap voter gets
> > to
> > > > 300
> > > > > no
> > > > > >> > log compaction would happen on the leader; and until later
> when
> > > the
> > > > > >> voter
> > > > > >> > have got to beyond 400 and hence replicated that tombstone,
> the
> > > log
> > > > > >> > compaction would possibly get to that tombstone and remove it.
> > Say
> > > > > >> later it
> > > > > >> > the leader's log compaction reaches 500, it can send this back
> > to
> > > > the
> > > > > >> voter
> > > > > >> > who can then also compact locally up to 500.
> > > > > >> >
> > > > > >> >
> > > > > >> > > > > 105. Quorum State: In addition to VotedId, do we need
> the
> > > > epoch
> > > > > >> > > > > corresponding to VotedId? Over time, the same broker Id
> > > could
> > > > be
> > > > > >> > voted
> > > > > >> > > in
> > > > > >> > > > > different generations with different epoch.
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > Hmm, this is a good point. Originally I think the
> > > "LeaderEpoch"
> > > > > >> field
> > > > > >> > in
> > > > > >> > > > that file is corresponding to the "latest known leader
> > epoch",
> > > > not
> > > > > >> the
> > > > > >> > > > "current leader epoch". For example, if the current epoch
> is
> > > N,
> > > > > and
> > > > > >> > then
> > > > > >> > > a
> > > > > >> > > > vote-request with epoch N+1 is received and the voter
> > granted
> > > > the
> > > > > >> vote
> > > > > >> > > for
> > > > > >> > > > it, then it means for this voter it knows the "latest
> epoch"
> > > is
> > > > N
> > > > > +
> > > > > >> 1
> > > > > >> > > > although it is unknown if that sending candidate will
> indeed
> > > > > become
> > > > > >> the
> > > > > >> > > new
> > > > > >> > > > leader (which would only be notified via begin-quorum
> > > request).
> > > > > >> > However,
> > > > > >> > > > when persisting the quorum state, we would encode
> > leader-epoch
> > > > to
> > > > > >> N+1,
> > > > > >> > > > while the leaderId to be the older leader.
> > > > > >> > > >
> > > > > >> > > > But now thinking about this a bit more, I feel we should
> use
> > > two
> > > > > >> > separate
> > > > > >> > > > epochs, one for the "lates known" and one for the
> "current"
> > to
> > > > > pair
> > > > > >> > with
> > > > > >> > > > the leaderId. I will update the wiki page.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > Hmm, it's kind of weird to bump up the leader epoch before
> the
> > > new
> > > > > >> leader
> > > > > >> > > is actually elected, right.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > > 106. "OFFSET_OUT_OF_RANGE: Used in the
> FetchQuorumRecords
> > > API
> > > > to
> > > > > >> > > indicate
> > > > > >> > > > > that the follower has fetched from an invalid offset and
> > > > should
> > > > > >> > > truncate
> > > > > >> > > > to
> > > > > >> > > > > the offset/epoch indicated in the response." Observers
> > can't
> > > > > >> truncate
> > > > > >> > > > their
> > > > > >> > > > > logs. What should they do with OFFSET_OUT_OF_RANGE?
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > I'm not sure if I understand your question? Observers
> should
> > > > still
> > > > > >> be
> > > > > >> > > able
> > > > > >> > > > to truncate their logs as well.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > Hmm, I thought only the quorum nodes have local logs and
> > > observers
> > > > > >> don't?
> > > > > >> > >
> > > > > >> > > > 107. "The leader will continue sending BeginQuorumEpoch to
> > > each
> > > > > >> known
> > > > > >> > > > voter
> > > > > >> > > > > until it has received its endorsement." If a voter is
> down
> > > > for a
> > > > > >> long
> > > > > >> > > > time,
> > > > > >> > > > > sending BeginQuorumEpoch seems to add unnecessary
> > overhead.
> > > > > >> > Similarly,
> > > > > >> > > > if a
> > > > > >> > > > > follower stops sending FetchQuorumRecords, does the
> leader
> > > > keep
> > > > > >> > sending
> > > > > >> > > > > BeginQuorumEpoch?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > Regarding BeginQuorumEpoch: that is a good point. The
> > > > > >> > begin-quorum-epoch
> > > > > >> > > > request is for voters to quickly get the new leader
> > > information;
> > > > > >> > however
> > > > > >> > > > even if they do not get them they can still eventually
> learn
> > > > about
> > > > > >> that
> > > > > >> > > > from others via gossiping FindQuorum. I think we can
> adjust
> > > the
> > > > > >> logic
> > > > > >> > to
> > > > > >> > > > e.g. exponential back-off or with a limited num.retries.
> > > > > >> > > >
> > > > > >> > > > Regarding FetchQuorumRecords: if the follower sends
> > > > > >> FetchQuorumRecords
> > > > > >> > > > already, it means that follower already knows that the
> > broker
> > > is
> > > > > the
> > > > > >> > > > leader, and hence we can stop retrying BeginQuorumEpoch;
> > > however
> > > > > it
> > > > > >> is
> > > > > >> > > > possible that after a follower sends FetchQuorumRecords
> > > already,
> > > > > >> > suddenly
> > > > > >> > > > it stops send it (possibly because it learned about a
> higher
> > > > epoch
> > > > > >> > > leader),
> > > > > >> > > > and hence this broker may be a "zombie" leader and we
> > propose
> > > to
> > > > > use
> > > > > >> > the
> > > > > >> > > > fetch.timeout to let the leader to try to verify if it has
> > > > already
> > > > > >> been
> > > > > >> > > > stale.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > It just seems that we should handle these two cases in a
> > > > consistent
> > > > > >> way?
> > > > > >> > >
> > > > > >> > > Yes I agree, on the leader's side, the FetchQuorumRecords
> > from a
> > > > > >> follower
> > > > > >> > could mean that we no longer needs to send BeginQuorumEpoch
> > > anymore
> > > > > ---
> > > > > >> and
> > > > > >> > it is already part of our current implementations in
> > > > > >> > https://github.com/confluentinc/kafka/commits/kafka-raft
> > > > > >> >
> > > > > >> >
> > > > > >> > > Thanks,
> > > > > >> > >
> > > > > >> > > Jun
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Wed, Apr 29, 2020 at 8:56 PM Guozhang Wang <
> > > > > [email protected]
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hello Leonard,
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks for your comments, I'm relying in line below:
> > > > > >> > > > > >
> > > > > >> > > > > > On Wed, Apr 29, 2020 at 1:58 AM Wang (Leonard) Ge <
> > > > > >> > [email protected]>
> > > > > >> > > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi Kafka developers,
> > > > > >> > > > > > >
> > > > > >> > > > > > > It's great to see this proposal and it took me some
> > time
> > > > to
> > > > > >> > finish
> > > > > >> > > > > > reading
> > > > > >> > > > > > > it.
> > > > > >> > > > > > >
> > > > > >> > > > > > > And I have the following questions about the
> Proposal:
> > > > > >> > > > > > >
> > > > > >> > > > > > >    - How do we plan to test this design to ensure
> its
> > > > > >> > correctness?
> > > > > >> > > Or
> > > > > >> > > > > > more
> > > > > >> > > > > > >    broadly, how do we ensure that our new ‘pull’
> based
> > > > model
> > > > > >> is
> > > > > >> > > > > > functional
> > > > > >> > > > > > > and
> > > > > >> > > > > > >    correct given that it is different from the
> > original
> > > > RAFT
> > > > > >> > > > > > implementation
> > > > > >> > > > > > >    which has formal proof of correctness?
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > We have two planned verifications on the correctness
> and
> > > > > >> liveness
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > > design. One is via model verification (TLA+)
> > > > > >> > > > > > https://github.com/guozhangwang/kafka-specification
> > > > > >> > > > > >
> > > > > >> > > > > > Another is via the concurrent simulation tests
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka/commit/5c0c054597d2d9f458cad0cad846b0671efa2d91
> > > > > >> > > > > >
> > > > > >> > > > > >    - Have we considered any sensible defaults for the
> > > > > >> > configuration,
> > > > > >> > > > i.e.
> > > > > >> > > > > > >    all the election timeout, fetch time out, etc.?
> Or
> > we
> > > > > want
> > > > > >> to
> > > > > >> > > > leave
> > > > > >> > > > > > > this to
> > > > > >> > > > > > >    a later stage when we do the performance testing,
> > > etc.
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > This is a good question, the reason we did not set any
> > > > default
> > > > > >> > values
> > > > > >> > > > for
> > > > > >> > > > > > the timeout configurations is that we think it may
> take
> > > some
> > > > > >> > > > benchmarking
> > > > > >> > > > > > experiments to get these defaults right. Some
> high-level
> > > > > >> principles
> > > > > >> > > to
> > > > > >> > > > > > consider: 1) the fetch.timeout should be around the
> same
> > > > scale
> > > > > >> with
> > > > > >> > > zk
> > > > > >> > > > > > session timeout, which is now 18 seconds by default --
> > in
> > > > > >> practice
> > > > > >> > > > we've
> > > > > >> > > > > > seen unstable networks having more than 10 secs of
> > > transient
> > > > > >> > > > > connectivity,
> > > > > >> > > > > > 2) the election.timeout, however, should be smaller
> than
> > > the
> > > > > >> fetch
> > > > > >> > > > > timeout
> > > > > >> > > > > > as is also suggested as a practical optimization in
> > > > > literature:
> > > > > >> > > > > >
> > > > https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf
> > > > > >> > > > > >
> > > > > >> > > > > > Some more discussions can be found here:
> > > > > >> > > > > >
> > > > > https://github.com/confluentinc/kafka/pull/301/files#r415420081
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > >    - Have we considered piggybacking
> > `BeginQuorumEpoch`
> > > > with
> > > > > >> the
> > > > > >> > `
> > > > > >> > > > > > >    FetchQuorumRecords`? I might be missing something
> > > > obvious
> > > > > >> but
> > > > > >> > I
> > > > > >> > > am
> > > > > >> > > > > > just
> > > > > >> > > > > > >    wondering why don’t we just use the `FindQuorum`
> > and
> > > > > >> > > > > > > `FetchQuorumRecords`
> > > > > >> > > > > > >    APIs and remove the `BeginQuorumEpoch` API?
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > Note that Begin/EndQuorumEpoch is sent from leader ->
> > > other
> > > > > >> voter
> > > > > >> > > > > > followers, while FindQuorum / Fetch are sent from
> > follower
> > > > to
> > > > > >> > leader.
> > > > > >> > > > > > Arguably one can eventually realize the new leader and
> > > epoch
> > > > > via
> > > > > >> > > > > gossiping
> > > > > >> > > > > > FindQuorum, but that could in practice require a long
> > > delay.
> > > > > >> > Having a
> > > > > >> > > > > > leader -> other voters request helps the new leader
> > epoch
> > > to
> > > > > be
> > > > > >> > > > > propagated
> > > > > >> > > > > > faster under a pull model.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > >    - And about the `FetchQuorumRecords` response
> > schema,
> > > > in
> > > > > >> the
> > > > > >> > > > > `Records`
> > > > > >> > > > > > >    field of the response, is it just one record or
> all
> > > the
> > > > > >> > records
> > > > > >> > > > > > starting
> > > > > >> > > > > > >    from the FetchOffset? It seems a lot more
> efficient
> > > if
> > > > we
> > > > > >> sent
> > > > > >> > > all
> > > > > >> > > > > the
> > > > > >> > > > > > >    records during the bootstrapping of the brokers.
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > Yes the fetching is batched: FetchOffset is just the
> > > > starting
> > > > > >> > offset
> > > > > >> > > of
> > > > > >> > > > > the
> > > > > >> > > > > > batch of records.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > >    - Regarding the disruptive broker issues, does
> our
> > > pull
> > > > > >> based
> > > > > >> > > > model
> > > > > >> > > > > > >    suffer from it? If so, have we considered the
> > > Pre-Vote
> > > > > >> stage?
> > > > > >> > If
> > > > > >> > > > > not,
> > > > > >> > > > > > > why?
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > The disruptive broker is stated in the original Raft
> > paper
> > > > > >> which is
> > > > > >> > > the
> > > > > >> > > > > > result of the push model design. Our analysis showed
> > that
> > > > with
> > > > > >> the
> > > > > >> > > pull
> > > > > >> > > > > > model it is no longer an issue.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > > Thanks a lot for putting this up, and I hope that my
> > > > > questions
> > > > > >> > can
> > > > > >> > > be
> > > > > >> > > > > of
> > > > > >> > > > > > > some value to make this KIP better.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Hope to hear from you soon!
> > > > > >> > > > > > >
> > > > > >> > > > > > > Best wishes,
> > > > > >> > > > > > > Leonard
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Wed, Apr 29, 2020 at 1:46 AM Colin McCabe <
> > > > > >> [email protected]
> > > > > >> > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Hi Jason,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > It's amazing to see this coming together :)
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > I haven't had a chance to read in detail, but I
> read
> > > the
> > > > > >> > outline
> > > > > >> > > > and
> > > > > >> > > > > a
> > > > > >> > > > > > > few
> > > > > >> > > > > > > > things jumped out at me.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > First, for every epoch that is 32 bits rather than
> > > 64, I
> > > > > >> sort
> > > > > >> > of
> > > > > >> > > > > wonder
> > > > > >> > > > > > > if
> > > > > >> > > > > > > > that's a good long-term choice.  I keep reading
> > about
> > > > > stuff
> > > > > >> > like
> > > > > >> > > > > this:
> > > > > >> > > > > > > >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> > > .
> > > > > >> > > Obviously,
> > > > > >> > > > > > that
> > > > > >> > > > > > > > JIRA is about zxid, which increments much faster
> > than
> > > we
> > > > > >> expect
> > > > > >> > > > these
> > > > > >> > > > > > > > leader epochs to, but it would still be good to
> see
> > > some
> > > > > >> rough
> > > > > >> > > > > > > calculations
> > > > > >> > > > > > > > about how long 32 bits (or really, 31 bits) will
> > last
> > > us
> > > > > in
> > > > > >> the
> > > > > >> > > > cases
> > > > > >> > > > > > > where
> > > > > >> > > > > > > > we're using it, and what the space savings we're
> > > getting
> > > > > >> really
> > > > > >> > > is.
> > > > > >> > > > > It
> > > > > >> > > > > > > > seems like in most cases the tradeoff may not be
> > worth
> > > > it?
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Another thing I've been thinking about is how we
> do
> > > > > >> > > > bootstrapping.  I
> > > > > >> > > > > > > > would prefer to be in a world where formatting a
> new
> > > > Kafka
> > > > > >> node
> > > > > >> > > > was a
> > > > > >> > > > > > > first
> > > > > >> > > > > > > > class operation explicitly initiated by the admin,
> > > > rather
> > > > > >> than
> > > > > >> > > > > > something
> > > > > >> > > > > > > > that happened implicitly when you started up the
> > > broker
> > > > > and
> > > > > >> > > things
> > > > > >> > > > > > > "looked
> > > > > >> > > > > > > > blank."
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > The first problem is that things can "look blank"
> > > > > >> accidentally
> > > > > >> > if
> > > > > >> > > > the
> > > > > >> > > > > > > > storage system is having a bad day.  Clearly in
> the
> > > > > non-Raft
> > > > > >> > > world,
> > > > > >> > > > > > this
> > > > > >> > > > > > > > leads to data loss if the broker that is
> (re)started
> > > > this
> > > > > >> way
> > > > > >> > was
> > > > > >> > > > the
> > > > > >> > > > > > > > leader for some partitions.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > The second problem is that we have a bit of a
> > chicken
> > > > and
> > > > > >> egg
> > > > > >> > > > problem
> > > > > >> > > > > > > with
> > > > > >> > > > > > > > certain configuration keys.  For example, maybe
> you
> > > want
> > > > > to
> > > > > >> > > > configure
> > > > > >> > > > > > > some
> > > > > >> > > > > > > > connection security settings in your cluster, but
> > you
> > > > > don't
> > > > > >> > want
> > > > > >> > > > them
> > > > > >> > > > > > to
> > > > > >> > > > > > > > ever be stored in a plaintext config file.  (For
> > > > example,
> > > > > >> SCRAM
> > > > > >> > > > > > > passwords,
> > > > > >> > > > > > > > etc.)  You could use a broker API to set the
> > > > > configuration,
> > > > > >> but
> > > > > >> > > > that
> > > > > >> > > > > > > brings
> > > > > >> > > > > > > > up the chicken and egg problem.  The broker needs
> to
> > > be
> > > > > >> > > configured
> > > > > >> > > > to
> > > > > >> > > > > > > know
> > > > > >> > > > > > > > how to talk to you, but you need to configure it
> > > before
> > > > > you
> > > > > >> can
> > > > > >> > > > talk
> > > > > >> > > > > to
> > > > > >> > > > > > > > it.  Using an external secret manager like Vault
> is
> > > one
> > > > > way
> > > > > >> to
> > > > > >> > > > solve
> > > > > >> > > > > > > this,
> > > > > >> > > > > > > > but not everyone uses an external secret manager.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > quorum.voters seems like a similar configuration
> > key.
> > > > In
> > > > > >> the
> > > > > >> > > > current
> > > > > >> > > > > > > KIP,
> > > > > >> > > > > > > > this is only read if there is no other
> configuration
> > > > > >> specifying
> > > > > >> > > the
> > > > > >> > > > > > > quorum
> > > > > >> > > > > > > > voter set.  If we had a kafka.mkfs command, we
> > > wouldn't
> > > > > need
> > > > > >> > this
> > > > > >> > > > key
> > > > > >> > > > > > > > because we could assume that there was always
> quorum
> > > > > >> > information
> > > > > >> > > > > stored
> > > > > >> > > > > > > > locally.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > best,
> > > > > >> > > > > > > > Colin
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Thu, Apr 16, 2020, at 16:44, Jason Gustafson
> > wrote:
> > > > > >> > > > > > > > > Hi All,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I'd like to start a discussion on KIP-595:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > > > > >> > > > > > > > .
> > > > > >> > > > > > > > > This proposal specifies a Raft protocol to
> > > ultimately
> > > > > >> replace
> > > > > >> > > > > > Zookeeper
> > > > > >> > > > > > > > > as
> > > > > >> > > > > > > > > documented in KIP-500. Please take a look and
> > share
> > > > your
> > > > > >> > > > thoughts.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > A few minor notes to set the stage a little bit:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > - This KIP does not specify the structure of the
> > > > > messages
> > > > > >> > used
> > > > > >> > > to
> > > > > >> > > > > > > > represent
> > > > > >> > > > > > > > > metadata in Kafka, nor does it specify the
> > internal
> > > > API
> > > > > >> that
> > > > > >> > > will
> > > > > >> > > > > be
> > > > > >> > > > > > > used
> > > > > >> > > > > > > > > by the controller. Expect these to come in later
> > > > > >> proposals.
> > > > > >> > > Here
> > > > > >> > > > we
> > > > > >> > > > > > are
> > > > > >> > > > > > > > > primarily concerned with the replication
> protocol
> > > and
> > > > > >> basic
> > > > > >> > > > > > operational
> > > > > >> > > > > > > > > mechanics.
> > > > > >> > > > > > > > > - We expect many details to change as we get
> > closer
> > > to
> > > > > >> > > > integration
> > > > > >> > > > > > with
> > > > > >> > > > > > > > > the controller. Any changes we make will be made
> > > > either
> > > > > as
> > > > > >> > > > > amendments
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > > this KIP or, in the case of larger changes, as
> new
> > > > > >> proposals.
> > > > > >> > > > > > > > > - We have a prototype implementation which I
> will
> > > put
> > > > > >> online
> > > > > >> > > > within
> > > > > >> > > > > > the
> > > > > >> > > > > > > > > next week which may help in understanding some
> > > > details.
> > > > > It
> > > > > >> > has
> > > > > >> > > > > > > diverged a
> > > > > >> > > > > > > > > little bit from our proposal, so I am taking a
> > > little
> > > > > >> time to
> > > > > >> > > > bring
> > > > > >> > > > > > it
> > > > > >> > > > > > > in
> > > > > >> > > > > > > > > line. I'll post an update to this thread when it
> > is
> > > > > >> available
> > > > > >> > > for
> > > > > >> > > > > > > review.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Finally, I want to mention that this proposal
> was
> > > > > drafted
> > > > > >> by
> > > > > >> > > > > myself,
> > > > > >> > > > > > > > Boyang
> > > > > >> > > > > > > > > Chen, and Guozhang Wang.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks,
> > > > > >> > > > > > > > > Jason
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > --
> > > > > >> > > > > > > Leonard Ge
> > > > > >> > > > > > > Software Engineer Intern - Confluent
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > --
> > > > > >> > > > > > -- Guozhang
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > -- Guozhang
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > -- Guozhang
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to