Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Jason Gustafson Wed, 29 Jul 2020 12:20:43 -0700

Hey Jun,

I added a section on "Cluster Bootstrapping" which discusses clusterId
generation and the process through which brokers find the current leader.
The quick summary is that the first controller will be responsible for
generating the clusterId and persisting it in the metadata log. Before the
first leader has been elected, quorum APIs will skip clusterId validation.
This seems reasonable since this is primarily intended to prevent the
damage from misconfiguration after a cluster has been running for some
time. Upon startup, brokers begin by sending Fetch requests to find the
current leader. This will include the cluster.id from meta.properties if it
is present. The broker will shutdown immediately if it receives
INVALID_CLUSTER_ID from the Fetch response.


I also added some details about our testing strategy, which you asked about
previously.

Thanks,
Jason

On Mon, Jul 27, 2020 at 10:46 PM Boyang Chen <reluctanthero...@gmail.com>
wrote:

> On Mon, Jul 27, 2020 at 4:58 AM Unmesh Joshi <unmeshjo...@gmail.com>
> wrote:
>
> > Just checked etcd and zookeeper code, and both support leader to step
> down
> > as a follower to make sure there are no two leaders if the leader has
> been
> > disconnected from the majority of the followers
> > For etcd this is https://github.com/etcd-io/etcd/issues/3866
> > For Zookeeper its https://issues.apache.org/jira/browse/ZOOKEEPER-1699
> > I was just thinking if it would be difficult to implement in the Pull
> based
> > model, but I guess not. It is possibly the same way ISR list is managed
> > currently, if leader of the controller quorum loses majority of the
> > followers, it should step down and become follower, that way, telling
> > client in time that it was disconnected from the quorum, and not keep on
> > sending state metadata to clients.
> >
> > Thanks,
> > Unmesh
> >
> >
> > On Mon, Jul 27, 2020 at 9:31 AM Unmesh Joshi <unmeshjo...@gmail.com>
> > wrote:
> >
> > > >>Could you clarify on this question? Which part of the raft group
> > doesn't
> > > >>know about leader dis-connection?
> > > The leader of the controller quorum is partitioned from the controller
> > > cluster, and a different leader is elected for the remaining controller
> > > cluster.
> >
> I see your concern. For KIP-595 implementation, since there is no regular
> heartbeats sent
> from the leader to the followers, we decided to piggy-back on the fetch
> timeout so that if the leader did not receive Fetch
> requests from a majority of the quorum for that amount of time, it would
> begin a new election and
> start sending VoteRequest to voter nodes in the cluster to understand the
> latest quorum. You could
> find more details in this section
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote
> >
> .
>
>
> > > I think there are two things here,
> > > 1.  The old leader will not know if it's disconnected from the rest of
> > the
> > > controller quorum cluster unless it receives BeginQuorumEpoch from the
> > new
> > > leader. So it will keep on serving stale metadata to the clients
> > (Brokers,
> > > Producers and Consumers)
> > > 2. I assume, the Broker Leases will be managed on the controller quorum
> > > leader. This partitioned leader will keep on tracking broker leases it
> > has,
> > > while the new leader of the quorum will also start managing broker
> > leases.
> > > So while the quorum leader is partitioned, there will be two membership
> > > views of the kafka brokers managed on two leaders.
> > > Unless broker heartbeats are also replicated as part of the Raft log,
> > > there is no way to solve this?
> > > I know LogCabin implementation does replicate client heartbeats. I
> > suspect
> > > that the same issue is there in Zookeeper, which does not replicate
> > client
> > > Ping requests..
> > >
> > > Thanks,
> > > Unmesh
> > >
> > >
> > >
> > > On Mon, Jul 27, 2020 at 6:23 AM Boyang Chen <
> reluctanthero...@gmail.com>
> > > wrote:
> > >
> > >> Thanks for the questions Unmesh!
> > >>
> > >> On Sun, Jul 26, 2020 at 6:18 AM Unmesh Joshi <unmeshjo...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > In the FetchRequest Handling, how to make sure we handle scenarios
> > where
> > >> > the leader might have been disconnected from the cluster, but
> doesn't
> > >> know
> > >> > yet?
> > >> >
> > >> Could you clarify on this question? Which part of the raft group
> doesn't
> > >> know about leader
> > >> dis-connection?
> > >>
> > >>
> > >> > As discussed in the Raft Thesis section 6.4, the linearizable
> > semantics
> > >> of
> > >> > read requests is implemented in LogCabin by sending heartbeat to
> > >> followers
> > >> > and waiting till the heartbeats are successful to make sure that the
> > >> leader
> > >> > is still the leader.
> > >> > I think for the controller quorum to make sure none of the consumers
> > get
> > >> > stale data, it's important to have linearizable semantics? In the
> pull
> > >> > based model, the leader will need to wait for heartbeats from the
> > >> followers
> > >> > before returning each fetch request from the consumer then? Or do we
> > >> need
> > >> > to introduce some other request?
> > >> > (Zookeeper does not have linearizable semantics for read requests,
> but
> > >> as
> > >> > of now all the kafka interactions are through writes and watches).
> > >> >
> > >> > This is a very good question. For our v1 implementation we are not
> > >> aiming
> > >> to guarantee linearizable read, which
> > >> would be considered as a follow-up effort. Note that today in Kafka
> > there
> > >> is no guarantee on the metadata freshness either,
> > >> so no regression is introduced.
> > >>
> > >>
> > >> > Thanks,
> > >> > Unmesh
> > >> >
> > >> > On Fri, Jul 24, 2020 at 11:36 PM Jun Rao <j...@confluent.io> wrote:
> > >> >
> > >> > > Hi, Jason,
> > >> > >
> > >> > > Thanks for the reply.
> > >> > >
> > >> > > 101. Sounds good. Regarding clusterId, I am not sure storing it in
> > the
> > >> > > metadata log is enough. For example, the vote request includes
> > >> clusterId.
> > >> > > So, no one can vote until they know the clusterId. Also, it would
> be
> > >> > useful
> > >> > > to support the case when a voter completely loses its disk and
> needs
> > >> to
> > >> > > recover.
> > >> > >
> > >> > > 210. There is no longer a FindQuorum request. When a follower
> > >> restarts,
> > >> > how
> > >> > > does it discover the leader? Is that based on DescribeQuorum? It
> > >> would be
> > >> > > useful to document this.
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > > On Fri, Jul 17, 2020 at 2:15 PM Jason Gustafson <
> ja...@confluent.io
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Jun,
> > >> > > >
> > >> > > > Thanks for the questions.
> > >> > > >
> > >> > > > 101. I am treating some of the bootstrapping problems as out of
> > the
> > >> > scope
> > >> > > > of this KIP. I am working on a separate proposal which addresses
> > >> > > > bootstrapping security credentials specifically. Here is a rough
> > >> sketch
> > >> > > of
> > >> > > > how I am seeing it:
> > >> > > >
> > >> > > > 1. Dynamic broker configurations including encrypted passwords
> > will
> > >> be
> > >> > > > persisted in the metadata log and cached in the broker's
> > >> > > `meta.properties`
> > >> > > > file.
> > >> > > > 2. We will provide a tool which allows users to directly
> override
> > >> the
> > >> > > > values in `meta.properties` without requiring access to the
> > quorum.
> > >> > This
> > >> > > > can be used to bootstrap the credentials of the voter set itself
> > >> before
> > >> > > the
> > >> > > > cluster has been started.
> > >> > > > 3. Some dynamic config changes will only be allowed when a
> broker
> > is
> > >> > > > online. For example, changing a truststore password dynamically
> > >> would
> > >> > > > prevent that broker from being able to start if it were offline
> > when
> > >> > the
> > >> > > > change was made.
> > >> > > > 4. I am still thinking a little bit about SCRAM credentials, but
> > >> most
> > >> > > > likely they will be handled with an approach similar to
> > >> > > `meta.properties`.
> > >> > > >
> > >> > > > 101.3 As for the question about `clusterId`, I think the way we
> > >> would
> > >> > do
> > >> > > > this is to have the first elected leader generate a UUID and
> write
> > >> it
> > >> > to
> > >> > > > the metadata log. Let me add some detail to the proposal about
> > this.
> > >> > > >
> > >> > > > A few additional answers below:
> > >> > > >
> > >> > > > 203. Yes, that is correct.
> > >> > > >
> > >> > > > 204. That is a good question. What happens in this case is that
> > all
> > >> > > voters
> > >> > > > advance their epoch to the one designated by the candidate even
> if
> > >> they
> > >> > > > reject its vote request. Assuming the candidate fails to be
> > elected,
> > >> > the
> > >> > > > election will be retried until a leader emerges.
> > >> > > >
> > >> > > > 205. I had some discussion with Colin offline about this
> problem.
> > I
> > >> > think
> > >> > > > the answer should be "yes," but it probably needs a little more
> > >> > thought.
> > >> > > > Handling JBOD failures is tricky. For an observer, we can
> > replicate
> > >> the
> > >> > > > metadata log from scratch safely in a new log dir. But if the
> log
> > >> dir
> > >> > of
> > >> > > a
> > >> > > > voter fails, I do not think it is generally safe to start from
> an
> > >> empty
> > >> > > > state.
> > >> > > >
> > >> > > > 206. Yes, that is discussed in KIP-631 I believe.
> > >> > > >
> > >> > > > 207. Good suggestion. I will work on this.
> > >> > > >
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Jason
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Thu, Jul 16, 2020 at 3:44 PM Jun Rao <j...@confluent.io>
> wrote:
> > >> > > >
> > >> > > > > Hi, Jason,
> > >> > > > >
> > >> > > > > Thanks for the updated KIP. Looks good overall. A few more
> > >> comments
> > >> > > > below.
> > >> > > > >
> > >> > > > > 101. I still don't see a section on bootstrapping related
> > issues.
> > >> It
> > >> > > > would
> > >> > > > > be useful to document if/how the following is supported.
> > >> > > > > 101.1 Currently, we support auto broker id generation. Is this
> > >> > > supported
> > >> > > > > for bootstrap brokers?
> > >> > > > > 101.2 As Colin mentioned, sometimes we may need to load the
> > >> security
> > >> > > > > credentials to be broker before it can be connected to. Could
> > you
> > >> > > > provide a
> > >> > > > > bit more detail on how this will work?
> > >> > > > > 101.3 Currently, we use ZK to generate clusterId on a new
> > cluster.
> > >> > With
> > >> > > > > Raft, how does every broker generate the same clusterId in a
> > >> > > distributed
> > >> > > > > way?
> > >> > > > >
> > >> > > > > 200. It would be useful to document if the various special
> > offsets
> > >> > (log
> > >> > > > > start offset, recovery point, HWM, etc) for the Raft log are
> > >> stored
> > >> > in
> > >> > > > the
> > >> > > > > same existing checkpoint files or not.
> > >> > > > > 200.1 Since the Raft log flushes every append, does that allow
> > us
> > >> to
> > >> > > > > recover from a recovery point within the active segment or do
> we
> > >> > still
> > >> > > > need
> > >> > > > > to scan the full segment including the recovery point? The
> > former
> > >> can
> > >> > > be
> > >> > > > > tricky since multiple records can fall into the same disk page
> > >> and a
> > >> > > > > subsequent flush may corrupt a page with previously flushed
> > >> records.
> > >> > > > >
> > >> > > > > 201. Configurations.
> > >> > > > > 201.1 How do the Raft brokers get security related configs for
> > >> inter
> > >> > > > broker
> > >> > > > > communication? Is that based on the existing
> > >> > > > > inter.broker.security.protocol?
> > >> > > > > 201.2 We have quorum.retry.backoff.max.ms and
> > >> > quorum.retry.backoff.ms,
> > >> > > > but
> > >> > > > > only quorum.election.backoff.max.ms. This seems a bit
> > >> inconsistent.
> > >> > > > >
> > >> > > > > 202. Metrics:
> > >> > > > > 202.1 TotalTimeMs, InboundQueueTimeMs, HandleTimeMs,
> > >> > > OutboundQueueTimeMs:
> > >> > > > > Are those the same as existing totalTime, requestQueueTime,
> > >> > localTime,
> > >> > > > > responseQueueTime? Could we reuse the existing ones with the
> tag
> > >> > > > > request=[request-type]?
> > >> > > > > 202.2. Could you explain what InboundChannelSize and
> > >> > > OutboundChannelSize
> > >> > > > > are?
> > >> > > > > 202.3 ElectionLatencyMax/Avg: It seems that both should be
> > >> windowed?
> > >> > > > >
> > >> > > > > 203. Quorum State: I assume that LeaderId will be kept
> > >> consistently
> > >> > > with
> > >> > > > > LeaderEpoch. For example, if a follower transitions to
> candidate
> > >> and
> > >> > > > bumps
> > >> > > > > up LeaderEpoch, it will set leaderId to -1 and persist both in
> > the
> > >> > > Quorum
> > >> > > > > state file. Is that correct?
> > >> > > > >
> > >> > > > > 204. I was thinking about a corner case when a Raft broker is
> > >> > > partitioned
> > >> > > > > off. This broker will then be in a continuous loop of bumping
> up
> > >> the
> > >> > > > leader
> > >> > > > > epoch, but failing to get enough votes. When the partitioning
> is
> > >> > > removed,
> > >> > > > > this broker's high leader epoch will force a leader election.
> I
> > >> > assume
> > >> > > > > other Raft brokers can immediately advance their leader epoch
> > >> passing
> > >> > > the
> > >> > > > > already bumped epoch such that leader election won't be
> delayed.
> > >> Is
> > >> > > that
> > >> > > > > right?
> > >> > > > >
> > >> > > > > 205. In a JBOD setting, could we use the existing tool to move
> > the
> > >> > Raft
> > >> > > > log
> > >> > > > > from one disk to another?
> > >> > > > >
> > >> > > > > 206. The KIP doesn't mention the local metadata store derived
> > from
> > >> > the
> > >> > > > Raft
> > >> > > > > log. Will that be covered in a separate KIP?
> > >> > > > >
> > >> > > > > 207. Since this is a critical component. Could we add a
> section
> > on
> > >> > the
> > >> > > > > testing plan for correctness?
> > >> > > > >
> > >> > > > > 208. Performance. Do we plan to do group commit (e.g. buffer
> > >> pending
> > >> > > > > appends during a flush and then flush all accumulated pending
> > >> records
> > >> > > > > together in the next flush) for better throughput?
> > >> > > > >
> > >> > > > > 209. "the leader can actually defer fsync until it knows
> > >> > "quorum.size -
> > >> > > > 1"
> > >> > > > > has get to a certain entry offset." Why is that "quorum.size -
> > 1"
> > >> > > instead
> > >> > > > > of the majority of the quorum?
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Jun
> > >> > > > >
> > >> > > > > On Mon, Jul 13, 2020 at 9:43 AM Jason Gustafson <
> > >> ja...@confluent.io>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi All,
> > >> > > > > >
> > >> > > > > > Just a quick update on the proposal. We have decided to move
> > >> quorum
> > >> > > > > > reassignment to a separate KIP:
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-642%3A+Dynamic+quorum+reassignment
> > >> > > > > > .
> > >> > > > > > The way this ties into cluster bootstrapping is complicated,
> > so
> > >> we
> > >> > > felt
> > >> > > > > we
> > >> > > > > > needed a bit more time for validation. That leaves the core
> of
> > >> this
> > >> > > > > > proposal as quorum-based replication. If there are no
> further
> > >> > > comments,
> > >> > > > > we
> > >> > > > > > will plan to start a vote later this week.
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > > Jason
> > >> > > > > >
> > >> > > > > > On Wed, Jun 24, 2020 at 10:43 AM Guozhang Wang <
> > >> wangg...@gmail.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > > @Jun Rao <jun...@gmail.com>
> > >> > > > > > >
> > >> > > > > > > Regarding your comment about log compaction. After some
> > >> > deep-diving
> > >> > > > > into
> > >> > > > > > > this we've decided to propose a new snapshot-based log
> > >> cleaning
> > >> > > > > mechanism
> > >> > > > > > > which would be used to replace the current compaction
> > >> mechanism
> > >> > for
> > >> > > > > this
> > >> > > > > > > meta log. A new KIP will be proposed specifically for this
> > >> idea.
> > >> > > > > > >
> > >> > > > > > > All,
> > >> > > > > > >
> > >> > > > > > > I've updated the KIP wiki a bit updating one config "
> > >> > > > > > > election.jitter.max.ms"
> > >> > > > > > > to "election.backoff.max.ms" to make it more clear about
> > the
> > >> > > usage:
> > >> > > > > the
> > >> > > > > > > configured value will be the upper bound of the binary
> > >> > exponential
> > >> > > > > > backoff
> > >> > > > > > > time after a failed election, before starting a new one.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Guozhang
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Fri, Jun 12, 2020 at 9:26 AM Boyang Chen <
> > >> > > > > reluctanthero...@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Thanks for the suggestions Guozhang.
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Jun 11, 2020 at 2:51 PM Guozhang Wang <
> > >> > > wangg...@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hello Boyang,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for the updated information. A few questions
> > here:
> > >> > > > > > > > >
> > >> > > > > > > > > 1) Should the quorum-file also update to support
> > >> multi-raft?
> > >> > > > > > > > >
> > >> > > > > > > > > I'm neutral about this, as we don't know yet how the
> > >> > multi-raft
> > >> > > > > > modules
> > >> > > > > > > > would behave. If
> > >> > > > > > > > we have different threads operating different raft
> groups,
> > >> > > > > > consolidating
> > >> > > > > > > > the `checkpoint` files seems
> > >> > > > > > > > not reasonable. We could always add `multi-quorum-file`
> > >> later
> > >> > if
> > >> > > > > > > possible.
> > >> > > > > > > >
> > >> > > > > > > > 2) In the previous proposal, there's fields in the
> > >> > > > FetchQuorumRecords
> > >> > > > > > > like
> > >> > > > > > > > > latestDirtyOffset, is that dropped intentionally?
> > >> > > > > > > > >
> > >> > > > > > > > > I dropped the latestDirtyOffset since it is associated
> > >> with
> > >> > the
> > >> > > > log
> > >> > > > > > > > compaction discussion. This is beyond this KIP scope and
> > we
> > >> > could
> > >> > > > > > > > potentially get a separate KIP to talk about it.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > > 3) I think we also need to elaborate a bit more
> details
> > >> > > regarding
> > >> > > > > > when
> > >> > > > > > > to
> > >> > > > > > > > > send metadata request and discover-brokers; currently
> we
> > >> only
> > >> > > > > > discussed
> > >> > > > > > > > > during bootstrap how these requests would be sent. I
> > think
> > >> > the
> > >> > > > > > > following
> > >> > > > > > > > > scenarios would also need these requests
> > >> > > > > > > > >
> > >> > > > > > > > > 3.a) As long as a broker does not know the current
> > quorum
> > >> > > > > (including
> > >> > > > > > > the
> > >> > > > > > > > > leader and the voters), it should continue
> periodically
> > >> ask
> > >> > > other
> > >> > > > > > > brokers
> > >> > > > > > > > > via "metadata.
> > >> > > > > > > > > 3.b) As long as a broker does not know all the current
> > >> quorum
> > >> > > > > voter's
> > >> > > > > > > > > connections, it should continue periodically ask other
> > >> > brokers
> > >> > > > via
> > >> > > > > > > > > "discover-brokers".
> > >> > > > > > > > > 3.c) When the leader's fetch timeout elapsed, it
> should
> > >> send
> > >> > > > > metadata
> > >> > > > > > > > > request.
> > >> > > > > > > > >
> > >> > > > > > > > > Make sense, will add to the KIP.
> > >> > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > Guozhang
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Wed, Jun 10, 2020 at 5:20 PM Boyang Chen <
> > >> > > > > > > reluctanthero...@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hey all,
> > >> > > > > > > > > >
> > >> > > > > > > > > > follow-up on the previous email, we made some more
> > >> updates:
> > >> > > > > > > > > >
> > >> > > > > > > > > > 1. The Alter/DescribeQuorum RPCs are also
> > re-structured
> > >> to
> > >> > > use
> > >> > > > > > > > > multi-raft.
> > >> > > > > > > > > >
> > >> > > > > > > > > > 2. We add observer status into the
> > >> DescribeQuorumResponse
> > >> > as
> > >> > > we
> > >> > > > > see
> > >> > > > > > > it
> > >> > > > > > > > > is a
> > >> > > > > > > > > > low hanging fruit which is very useful for user
> > >> debugging
> > >> > and
> > >> > > > > > > > > reassignment.
> > >> > > > > > > > > >
> > >> > > > > > > > > > 3. The FindQuorum RPC is replaced with
> DiscoverBrokers
> > >> RPC,
> > >> > > > which
> > >> > > > > > is
> > >> > > > > > > > > purely
> > >> > > > > > > > > > in charge of discovering broker connections in a
> > gossip
> > >> > > manner.
> > >> > > > > The
> > >> > > > > > > > > quorum
> > >> > > > > > > > > > leader discovery is piggy-back on the Metadata RPC
> for
> > >> the
> > >> > > > topic
> > >> > > > > > > > > partition
> > >> > > > > > > > > > leader, which in our case is the single metadata
> > >> partition
> > >> > > for
> > >> > > > > the
> > >> > > > > > > > > version
> > >> > > > > > > > > > one.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Let me know if you have any questions.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Boyang
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Tue, Jun 9, 2020 at 11:01 PM Boyang Chen <
> > >> > > > > > > > reluctanthero...@gmail.com>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hey all,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Thanks for the great discussions so far. I'm
> posting
> > >> some
> > >> > > KIP
> > >> > > > > > > updates
> > >> > > > > > > > > > from
> > >> > > > > > > > > > > our working group discussion:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 1. We will be changing the core RPCs from
> > single-raft
> > >> API
> > >> > > to
> > >> > > > > > > > > multi-raft.
> > >> > > > > > > > > > > This means all protocols will be "batch" in the
> > first
> > >> > > > version,
> > >> > > > > > but
> > >> > > > > > > > the
> > >> > > > > > > > > > KIP
> > >> > > > > > > > > > > itself only illustrates the design for a single
> > >> metadata
> > >> > > > topic
> > >> > > > > > > > > partition.
> > >> > > > > > > > > > > The reason is to "keep the door open" for future
> > >> > extensions
> > >> > > > of
> > >> > > > > > this
> > >> > > > > > > > > piece
> > >> > > > > > > > > > > of module such as a sharded controller or general
> > >> quorum
> > >> > > > based
> > >> > > > > > > topic
> > >> > > > > > > > > > > replication, beyond the current Kafka replication
> > >> > protocol.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 2. We will piggy-back on the current Kafka Fetch
> API
> > >> > > instead
> > >> > > > of
> > >> > > > > > > > > inventing
> > >> > > > > > > > > > > a new FetchQuorumRecords RPC. The motivation is
> > about
> > >> the
> > >> > > > same
> > >> > > > > as
> > >> > > > > > > #1
> > >> > > > > > > > as
> > >> > > > > > > > > > > well as making the integration work easier,
> instead
> > of
> > >> > > > letting
> > >> > > > > > two
> > >> > > > > > > > > > similar
> > >> > > > > > > > > > > RPCs diverge.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > 3. In the EndQuorumEpoch protocol, instead of only
> > >> > sending
> > >> > > > the
> > >> > > > > > > > request
> > >> > > > > > > > > to
> > >> > > > > > > > > > > the most caught-up voter, we shall broadcast the
> > >> > > information
> > >> > > > to
> > >> > > > > > all
> > >> > > > > > > > > > voters,
> > >> > > > > > > > > > > with a sorted voter list in descending order of
> > their
> > >> > > > > > corresponding
> > >> > > > > > > > > > > replicated offset. In this way, the top voter will
> > >> > become a
> > >> > > > > > > candidate
> > >> > > > > > > > > > > immediately, while the other voters shall wait for
> > an
> > >> > > > > exponential
> > >> > > > > > > > > > back-off
> > >> > > > > > > > > > > to trigger elections, which helps ensure the top
> > voter
> > >> > gets
> > >> > > > > > > elected,
> > >> > > > > > > > > and
> > >> > > > > > > > > > > the election eventually happens when the top voter
> > is
> > >> not
> > >> > > > > > > responsive.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Please see the updated KIP and post any questions
> or
> > >> > > concerns
> > >> > > > > on
> > >> > > > > > > the
> > >> > > > > > > > > > > mailing thread.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Boyang
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Fri, May 8, 2020 at 5:26 PM Jun Rao <
> > >> j...@confluent.io
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >> Hi, Guozhang and Jason,
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >> Thanks for the reply. A couple of more replies.
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >> 102. Still not sure about this. How is the
> > tombstone
> > >> > issue
> > >> > > > > > > addressed
> > >> > > > > > > > > in
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> non-voter and the observer.  They can die at any
> > >> point
> > >> > and
> > >> > > > > > restart
> > >> > > > > > > > at
> > >> > > > > > > > > an
> > >> > > > > > > > > > >> arbitrary later time, and the advancing of the
> > >> > firstDirty
> > >> > > > > offset
> > >> > > > > > > and
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> removal of the tombstone can happen
> independently.
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >> 106. I agree that it would be less confusing if
> we
> > >> used
> > >> > > > > "epoch"
> > >> > > > > > > > > instead
> > >> > > > > > > > > > of
> > >> > > > > > > > > > >> "leader epoch" consistently.
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >> Jun
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >> On Thu, May 7, 2020 at 4:04 PM Guozhang Wang <
> > >> > > > > > wangg...@gmail.com>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >> > Thanks Jun. Further replies are in-lined.
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > On Mon, May 4, 2020 at 11:58 AM Jun Rao <
> > >> > > j...@confluent.io
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > > Hi, Guozhang,
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > Thanks for the reply. A few more replies
> > inlined
> > >> > > below.
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > On Sun, May 3, 2020 at 6:33 PM Guozhang Wang
> <
> > >> > > > > > > > wangg...@gmail.com>
> > >> > > > > > > > > > >> wrote:
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > > Hello Jun,
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > Thanks for your comments! I'm replying
> inline
> > >> > below:
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > On Fri, May 1, 2020 at 12:36 PM Jun Rao <
> > >> > > > > j...@confluent.io
> > >> > > > > > >
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > > 101. Bootstrapping related issues.
> > >> > > > > > > > > > >> > > > > 101.1 Currently, we support auto broker
> id
> > >> > > > generation.
> > >> > > > > > Is
> > >> > > > > > > > this
> > >> > > > > > > > > > >> > > supported
> > >> > > > > > > > > > >> > > > > for bootstrap brokers?
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > The vote ids would just be the broker ids.
> > >> > > > > > > "bootstrap.servers"
> > >> > > > > > > > > > >> would be
> > >> > > > > > > > > > >> > > > similar to what client configs have today,
> > >> where
> > >> > > > > > > > "quorum.voters"
> > >> > > > > > > > > > >> would
> > >> > > > > > > > > > >> > be
> > >> > > > > > > > > > >> > > > pre-defined config values.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > My question was on the auto generated broker
> > id.
> > >> > > > > Currently,
> > >> > > > > > > the
> > >> > > > > > > > > > broker
> > >> > > > > > > > > > >> > can
> > >> > > > > > > > > > >> > > choose to have its broker Id auto generated.
> > The
> > >> > > > > generation
> > >> > > > > > is
> > >> > > > > > > > > done
> > >> > > > > > > > > > >> > through
> > >> > > > > > > > > > >> > > ZK to guarantee uniqueness. Without ZK, it's
> > not
> > >> > clear
> > >> > > > how
> > >> > > > > > the
> > >> > > > > > > > > > broker
> > >> > > > > > > > > > >> id
> > >> > > > > > > > > > >> > is
> > >> > > > > > > > > > >> > > auto generated. "quorum.voters" also can't be
> > set
> > >> > > > > statically
> > >> > > > > > > if
> > >> > > > > > > > > > broker
> > >> > > > > > > > > > >> > ids
> > >> > > > > > > > > > >> > > are auto generated.
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > Jason has explained some ideas that we've
> > >> discussed
> > >> > so
> > >> > > > > far,
> > >> > > > > > > the
> > >> > > > > > > > > > >> reason we
> > >> > > > > > > > > > >> > intentional did not include them so far is that
> > we
> > >> > feel
> > >> > > it
> > >> > > > > is
> > >> > > > > > > > > out-side
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > scope of KIP-595. Under the umbrella of KIP-500
> > we
> > >> > > should
> > >> > > > > > > > definitely
> > >> > > > > > > > > > >> > address them though.
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > On the high-level, our belief is that "joining
> a
> > >> > quorum"
> > >> > > > and
> > >> > > > > > > > > "joining
> > >> > > > > > > > > > >> (or
> > >> > > > > > > > > > >> > more specifically, registering brokers in) the
> > >> > cluster"
> > >> > > > > would
> > >> > > > > > be
> > >> > > > > > > > > > >> > de-coupled a bit, where the former should be
> > >> completed
> > >> > > > > before
> > >> > > > > > we
> > >> > > > > > > > do
> > >> > > > > > > > > > the
> > >> > > > > > > > > > >> > latter. More specifically, assuming the quorum
> is
> > >> > > already
> > >> > > > up
> > >> > > > > > and
> > >> > > > > > > > > > >> running,
> > >> > > > > > > > > > >> > after the newly started broker found the leader
> > of
> > >> the
> > >> > > > > quorum
> > >> > > > > > it
> > >> > > > > > > > can
> > >> > > > > > > > > > >> send a
> > >> > > > > > > > > > >> > specific RegisterBroker request including its
> > >> > listener /
> > >> > > > > > > protocol
> > >> > > > > > > > /
> > >> > > > > > > > > > etc,
> > >> > > > > > > > > > >> > and upon handling it the leader can send back
> the
> > >> > > uniquely
> > >> > > > > > > > generated
> > >> > > > > > > > > > >> broker
> > >> > > > > > > > > > >> > id to the new broker, while also executing the
> > >> > > > > > "startNewBroker"
> > >> > > > > > > > > > >> callback as
> > >> > > > > > > > > > >> > the controller.
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > > > 102. Log compaction. One weak spot of log
> > >> > > compaction
> > >> > > > > is
> > >> > > > > > > for
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> > > consumer
> > >> > > > > > > > > > >> > > > to
> > >> > > > > > > > > > >> > > > > deal with deletes. When a key is deleted,
> > >> it's
> > >> > > > > retained
> > >> > > > > > > as a
> > >> > > > > > > > > > >> > tombstone
> > >> > > > > > > > > > >> > > > > first and then physically removed. If a
> > >> client
> > >> > > > misses
> > >> > > > > > the
> > >> > > > > > > > > > >> tombstone
> > >> > > > > > > > > > >> > > > > (because it's physically removed), it may
> > >> not be
> > >> > > > able
> > >> > > > > to
> > >> > > > > > > > > update
> > >> > > > > > > > > > >> its
> > >> > > > > > > > > > >> > > > > metadata properly. The way we solve this
> in
> > >> > Kafka
> > >> > > is
> > >> > > > > > based
> > >> > > > > > > > on
> > >> > > > > > > > > a
> > >> > > > > > > > > > >> > > > > configuration (
> > >> log.cleaner.delete.retention.ms)
> > >> > > and
> > >> > > > > we
> > >> > > > > > > > > expect a
> > >> > > > > > > > > > >> > > consumer
> > >> > > > > > > > > > >> > > > > having seen an old key to finish reading
> > the
> > >> > > > deletion
> > >> > > > > > > > > tombstone
> > >> > > > > > > > > > >> > within
> > >> > > > > > > > > > >> > > > that
> > >> > > > > > > > > > >> > > > > time. There is no strong guarantee for
> that
> > >> > since
> > >> > > a
> > >> > > > > > broker
> > >> > > > > > > > > could
> > >> > > > > > > > > > >> be
> > >> > > > > > > > > > >> > > down
> > >> > > > > > > > > > >> > > > > for a long time. It would be better if we
> > can
> > >> > > have a
> > >> > > > > > more
> > >> > > > > > > > > > reliable
> > >> > > > > > > > > > >> > way
> > >> > > > > > > > > > >> > > of
> > >> > > > > > > > > > >> > > > > dealing with deletes.
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > We propose to capture this in the
> > >> > "FirstDirtyOffset"
> > >> > > > > field
> > >> > > > > > > of
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> > quorum
> > >> > > > > > > > > > >> > > > record fetch response: the offset is the
> > >> maximum
> > >> > > > offset
> > >> > > > > > that
> > >> > > > > > > > log
> > >> > > > > > > > > > >> > > compaction
> > >> > > > > > > > > > >> > > > has reached up to. If the follower has
> > fetched
> > >> > > beyond
> > >> > > > > this
> > >> > > > > > > > > offset
> > >> > > > > > > > > > it
> > >> > > > > > > > > > >> > > means
> > >> > > > > > > > > > >> > > > itself is safe hence it has seen all
> records
> > >> up to
> > >> > > > that
> > >> > > > > > > > offset.
> > >> > > > > > > > > On
> > >> > > > > > > > > > >> > > getting
> > >> > > > > > > > > > >> > > > the response, the follower can then decide
> if
> > >> its
> > >> > > end
> > >> > > > > > offset
> > >> > > > > > > > > > >> actually
> > >> > > > > > > > > > >> > > below
> > >> > > > > > > > > > >> > > > that dirty offset (and hence may miss some
> > >> > > > tombstones).
> > >> > > > > If
> > >> > > > > > > > > that's
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > case:
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > 1) Naively, it could re-bootstrap metadata
> > log
> > >> > from
> > >> > > > the
> > >> > > > > > very
> > >> > > > > > > > > > >> beginning
> > >> > > > > > > > > > >> > to
> > >> > > > > > > > > > >> > > > catch up.
> > >> > > > > > > > > > >> > > > 2) During that time, it would refrain
> itself
> > >> from
> > >> > > > > > answering
> > >> > > > > > > > > > >> > > MetadataRequest
> > >> > > > > > > > > > >> > > > from any clients.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > I am not sure that the "FirstDirtyOffset"
> field
> > >> > fully
> > >> > > > > > > addresses
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> > issue.
> > >> > > > > > > > > > >> > > Currently, the deletion tombstone is not
> > removed
> > >> > > > > immediately
> > >> > > > > > > > > after a
> > >> > > > > > > > > > >> > round
> > >> > > > > > > > > > >> > > of cleaning. It's removed after a delay in a
> > >> > > subsequent
> > >> > > > > > round
> > >> > > > > > > of
> > >> > > > > > > > > > >> > cleaning.
> > >> > > > > > > > > > >> > > Consider an example where a key insertion is
> at
> > >> > offset
> > >> > > > 200
> > >> > > > > > > and a
> > >> > > > > > > > > > >> deletion
> > >> > > > > > > > > > >> > > tombstone of the key is at 400. Initially,
> > >> > > > > FirstDirtyOffset
> > >> > > > > > is
> > >> > > > > > > > at
> > >> > > > > > > > > > >> 300. A
> > >> > > > > > > > > > >> > > follower/observer fetches from offset 0  and
> > >> fetches
> > >> > > the
> > >> > > > > key
> > >> > > > > > > at
> > >> > > > > > > > > > offset
> > >> > > > > > > > > > >> > 200.
> > >> > > > > > > > > > >> > > A few rounds of cleaning happen.
> > >> FirstDirtyOffset is
> > >> > > at
> > >> > > > > 500
> > >> > > > > > > and
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> > > tombstone at 400 is physically removed. The
> > >> > > > > > follower/observer
> > >> > > > > > > > > > >> continues
> > >> > > > > > > > > > >> > the
> > >> > > > > > > > > > >> > > fetch, but misses offset 400. It catches all
> > the
> > >> way
> > >> > > to
> > >> > > > > > > > > > >> FirstDirtyOffset
> > >> > > > > > > > > > >> > > and declares its metadata as ready. However,
> > its
> > >> > > > metadata
> > >> > > > > > > could
> > >> > > > > > > > be
> > >> > > > > > > > > > >> stale
> > >> > > > > > > > > > >> > > since it actually misses the deletion of the
> > key.
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > Yeah good question, I should have put more
> > >> details
> > >> > in
> > >> > > my
> > >> > > > > > > > > explanation
> > >> > > > > > > > > > >> :)
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > The idea is that we will adjust the log
> > compaction
> > >> for
> > >> > > > this
> > >> > > > > > raft
> > >> > > > > > > > > based
> > >> > > > > > > > > > >> > metadata log: before more details to be
> > explained,
> > >> > since
> > >> > > > we
> > >> > > > > > have
> > >> > > > > > > > two
> > >> > > > > > > > > > >> types
> > >> > > > > > > > > > >> > of "watermarks" here, whereas in Kafka the
> > >> watermark
> > >> > > > > indicates
> > >> > > > > > > > where
> > >> > > > > > > > > > >> every
> > >> > > > > > > > > > >> > replica have replicated up to and in Raft the
> > >> > watermark
> > >> > > > > > > indicates
> > >> > > > > > > > > > where
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > majority of replicas (here only indicating
> voters
> > >> of
> > >> > the
> > >> > > > > > quorum,
> > >> > > > > > > > not
> > >> > > > > > > > > > >> > counting observers) have replicated up to,
> let's
> > >> call
> > >> > > them
> > >> > > > > > Kafka
> > >> > > > > > > > > > >> watermark
> > >> > > > > > > > > > >> > and Raft watermark. For this special log, we
> > would
> > >> > > > maintain
> > >> > > > > > both
> > >> > > > > > > > > > >> > watermarks.
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > When log compacting on the leader, we would
> only
> > >> > compact
> > >> > > > up
> > >> > > > > to
> > >> > > > > > > the
> > >> > > > > > > > > > Kafka
> > >> > > > > > > > > > >> > watermark, i.e. if there is at least one voter
> > who
> > >> > have
> > >> > > > not
> > >> > > > > > > > > replicated
> > >> > > > > > > > > > >> an
> > >> > > > > > > > > > >> > entry, it would not be compacted. The
> > >> "dirty-offset"
> > >> > is
> > >> > > > the
> > >> > > > > > > offset
> > >> > > > > > > > > > that
> > >> > > > > > > > > > >> > we've compacted up to and is communicated to
> > other
> > >> > > voters,
> > >> > > > > and
> > >> > > > > > > the
> > >> > > > > > > > > > other
> > >> > > > > > > > > > >> > voters would also compact up to this value ---
> > i.e.
> > >> > the
> > >> > > > > > > difference
> > >> > > > > > > > > > here
> > >> > > > > > > > > > >> is
> > >> > > > > > > > > > >> > that instead of letting each replica doing log
> > >> > > compaction
> > >> > > > > > > > > > independently,
> > >> > > > > > > > > > >> > we'll have the leader to decide upon which
> offset
> > >> to
> > >> > > > compact
> > >> > > > > > to,
> > >> > > > > > > > and
> > >> > > > > > > > > > >> > propagate this value to others to follow, in a
> > more
> > >> > > > > > coordinated
> > >> > > > > > > > > > manner.
> > >> > > > > > > > > > >> > Also note when there are new voters joining the
> > >> quorum
> > >> > > who
> > >> > > > > has
> > >> > > > > > > not
> > >> > > > > > > > > > >> > replicated up to the dirty-offset, of because
> of
> > >> other
> > >> > > > > issues
> > >> > > > > > > they
> > >> > > > > > > > > > >> > truncated their logs to below the dirty-offset,
> > >> they'd
> > >> > > > have
> > >> > > > > to
> > >> > > > > > > > > > >> re-bootstrap
> > >> > > > > > > > > > >> > from the beginning, and during this period of
> > time
> > >> the
> > >> > > > > leader
> > >> > > > > > > > > learned
> > >> > > > > > > > > > >> about
> > >> > > > > > > > > > >> > this lagging voter would not advance the
> > watermark
> > >> > (also
> > >> > > > it
> > >> > > > > > > would
> > >> > > > > > > > > not
> > >> > > > > > > > > > >> > decrement it), and hence not compacting either,
> > >> until
> > >> > > the
> > >> > > > > > > voter(s)
> > >> > > > > > > > > has
> > >> > > > > > > > > > >> > caught up to that dirty-offset.
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > So back to your example above, before the
> > bootstrap
> > >> > > voter
> > >> > > > > gets
> > >> > > > > > > to
> > >> > > > > > > > > 300
> > >> > > > > > > > > > no
> > >> > > > > > > > > > >> > log compaction would happen on the leader; and
> > >> until
> > >> > > later
> > >> > > > > > when
> > >> > > > > > > > the
> > >> > > > > > > > > > >> voter
> > >> > > > > > > > > > >> > have got to beyond 400 and hence replicated
> that
> > >> > > > tombstone,
> > >> > > > > > the
> > >> > > > > > > > log
> > >> > > > > > > > > > >> > compaction would possibly get to that tombstone
> > and
> > >> > > remove
> > >> > > > > it.
> > >> > > > > > > Say
> > >> > > > > > > > > > >> later it
> > >> > > > > > > > > > >> > the leader's log compaction reaches 500, it can
> > >> send
> > >> > > this
> > >> > > > > back
> > >> > > > > > > to
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> voter
> > >> > > > > > > > > > >> > who can then also compact locally up to 500.
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > > > > 105. Quorum State: In addition to
> VotedId,
> > >> do we
> > >> > > > need
> > >> > > > > > the
> > >> > > > > > > > > epoch
> > >> > > > > > > > > > >> > > > > corresponding to VotedId? Over time, the
> > same
> > >> > > broker
> > >> > > > > Id
> > >> > > > > > > > could
> > >> > > > > > > > > be
> > >> > > > > > > > > > >> > voted
> > >> > > > > > > > > > >> > > in
> > >> > > > > > > > > > >> > > > > different generations with different
> epoch.
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > Hmm, this is a good point. Originally I
> think
> > >> the
> > >> > > > > > > > "LeaderEpoch"
> > >> > > > > > > > > > >> field
> > >> > > > > > > > > > >> > in
> > >> > > > > > > > > > >> > > > that file is corresponding to the "latest
> > known
> > >> > > leader
> > >> > > > > > > epoch",
> > >> > > > > > > > > not
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > > "current leader epoch". For example, if the
> > >> > current
> > >> > > > > epoch
> > >> > > > > > is
> > >> > > > > > > > N,
> > >> > > > > > > > > > and
> > >> > > > > > > > > > >> > then
> > >> > > > > > > > > > >> > > a
> > >> > > > > > > > > > >> > > > vote-request with epoch N+1 is received and
> > the
> > >> > > voter
> > >> > > > > > > granted
> > >> > > > > > > > > the
> > >> > > > > > > > > > >> vote
> > >> > > > > > > > > > >> > > for
> > >> > > > > > > > > > >> > > > it, then it means for this voter it knows
> the
> > >> > > "latest
> > >> > > > > > epoch"
> > >> > > > > > > > is
> > >> > > > > > > > > N
> > >> > > > > > > > > > +
> > >> > > > > > > > > > >> 1
> > >> > > > > > > > > > >> > > > although it is unknown if that sending
> > >> candidate
> > >> > > will
> > >> > > > > > indeed
> > >> > > > > > > > > > become
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > new
> > >> > > > > > > > > > >> > > > leader (which would only be notified via
> > >> > > begin-quorum
> > >> > > > > > > > request).
> > >> > > > > > > > > > >> > However,
> > >> > > > > > > > > > >> > > > when persisting the quorum state, we would
> > >> encode
> > >> > > > > > > leader-epoch
> > >> > > > > > > > > to
> > >> > > > > > > > > > >> N+1,
> > >> > > > > > > > > > >> > > > while the leaderId to be the older leader.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > But now thinking about this a bit more, I
> > feel
> > >> we
> > >> > > > should
> > >> > > > > > use
> > >> > > > > > > > two
> > >> > > > > > > > > > >> > separate
> > >> > > > > > > > > > >> > > > epochs, one for the "lates known" and one
> for
> > >> the
> > >> > > > > > "current"
> > >> > > > > > > to
> > >> > > > > > > > > > pair
> > >> > > > > > > > > > >> > with
> > >> > > > > > > > > > >> > > > the leaderId. I will update the wiki page.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > Hmm, it's kind of weird to bump up the leader
> > >> epoch
> > >> > > > before
> > >> > > > > > the
> > >> > > > > > > > new
> > >> > > > > > > > > > >> leader
> > >> > > > > > > > > > >> > > is actually elected, right.
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > > > 106. "OFFSET_OUT_OF_RANGE: Used in the
> > >> > > > > > FetchQuorumRecords
> > >> > > > > > > > API
> > >> > > > > > > > > to
> > >> > > > > > > > > > >> > > indicate
> > >> > > > > > > > > > >> > > > > that the follower has fetched from an
> > invalid
> > >> > > offset
> > >> > > > > and
> > >> > > > > > > > > should
> > >> > > > > > > > > > >> > > truncate
> > >> > > > > > > > > > >> > > > to
> > >> > > > > > > > > > >> > > > > the offset/epoch indicated in the
> > response."
> > >> > > > Observers
> > >> > > > > > > can't
> > >> > > > > > > > > > >> truncate
> > >> > > > > > > > > > >> > > > their
> > >> > > > > > > > > > >> > > > > logs. What should they do with
> > >> > > OFFSET_OUT_OF_RANGE?
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > I'm not sure if I understand your question?
> > >> > > Observers
> > >> > > > > > should
> > >> > > > > > > > > still
> > >> > > > > > > > > > >> be
> > >> > > > > > > > > > >> > > able
> > >> > > > > > > > > > >> > > > to truncate their logs as well.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > Hmm, I thought only the quorum nodes have
> local
> > >> logs
> > >> > > and
> > >> > > > > > > > observers
> > >> > > > > > > > > > >> don't?
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > > 107. "The leader will continue sending
> > >> > > > BeginQuorumEpoch
> > >> > > > > to
> > >> > > > > > > > each
> > >> > > > > > > > > > >> known
> > >> > > > > > > > > > >> > > > voter
> > >> > > > > > > > > > >> > > > > until it has received its endorsement."
> If
> > a
> > >> > voter
> > >> > > > is
> > >> > > > > > down
> > >> > > > > > > > > for a
> > >> > > > > > > > > > >> long
> > >> > > > > > > > > > >> > > > time,
> > >> > > > > > > > > > >> > > > > sending BeginQuorumEpoch seems to add
> > >> > unnecessary
> > >> > > > > > > overhead.
> > >> > > > > > > > > > >> > Similarly,
> > >> > > > > > > > > > >> > > > if a
> > >> > > > > > > > > > >> > > > > follower stops sending
> FetchQuorumRecords,
> > >> does
> > >> > > the
> > >> > > > > > leader
> > >> > > > > > > > > keep
> > >> > > > > > > > > > >> > sending
> > >> > > > > > > > > > >> > > > > BeginQuorumEpoch?
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > Regarding BeginQuorumEpoch: that is a good
> > >> point.
> > >> > > The
> > >> > > > > > > > > > >> > begin-quorum-epoch
> > >> > > > > > > > > > >> > > > request is for voters to quickly get the
> new
> > >> > leader
> > >> > > > > > > > information;
> > >> > > > > > > > > > >> > however
> > >> > > > > > > > > > >> > > > even if they do not get them they can still
> > >> > > eventually
> > >> > > > > > learn
> > >> > > > > > > > > about
> > >> > > > > > > > > > >> that
> > >> > > > > > > > > > >> > > > from others via gossiping FindQuorum. I
> think
> > >> we
> > >> > can
> > >> > > > > > adjust
> > >> > > > > > > > the
> > >> > > > > > > > > > >> logic
> > >> > > > > > > > > > >> > to
> > >> > > > > > > > > > >> > > > e.g. exponential back-off or with a limited
> > >> > > > num.retries.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > Regarding FetchQuorumRecords: if the
> follower
> > >> > sends
> > >> > > > > > > > > > >> FetchQuorumRecords
> > >> > > > > > > > > > >> > > > already, it means that follower already
> knows
> > >> that
> > >> > > the
> > >> > > > > > > broker
> > >> > > > > > > > is
> > >> > > > > > > > > > the
> > >> > > > > > > > > > >> > > > leader, and hence we can stop retrying
> > >> > > > BeginQuorumEpoch;
> > >> > > > > > > > however
> > >> > > > > > > > > > it
> > >> > > > > > > > > > >> is
> > >> > > > > > > > > > >> > > > possible that after a follower sends
> > >> > > > FetchQuorumRecords
> > >> > > > > > > > already,
> > >> > > > > > > > > > >> > suddenly
> > >> > > > > > > > > > >> > > > it stops send it (possibly because it
> learned
> > >> > about
> > >> > > a
> > >> > > > > > higher
> > >> > > > > > > > > epoch
> > >> > > > > > > > > > >> > > leader),
> > >> > > > > > > > > > >> > > > and hence this broker may be a "zombie"
> > leader
> > >> and
> > >> > > we
> > >> > > > > > > propose
> > >> > > > > > > > to
> > >> > > > > > > > > > use
> > >> > > > > > > > > > >> > the
> > >> > > > > > > > > > >> > > > fetch.timeout to let the leader to try to
> > >> verify
> > >> > if
> > >> > > it
> > >> > > > > has
> > >> > > > > > > > > already
> > >> > > > > > > > > > >> been
> > >> > > > > > > > > > >> > > > stale.
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > It just seems that we should handle these two
> > >> cases
> > >> > > in a
> > >> > > > > > > > > consistent
> > >> > > > > > > > > > >> way?
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > Yes I agree, on the leader's side, the
> > >> > > > FetchQuorumRecords
> > >> > > > > > > from a
> > >> > > > > > > > > > >> follower
> > >> > > > > > > > > > >> > could mean that we no longer needs to send
> > >> > > > BeginQuorumEpoch
> > >> > > > > > > > anymore
> > >> > > > > > > > > > ---
> > >> > > > > > > > > > >> and
> > >> > > > > > > > > > >> > it is already part of our current
> implementations
> > >> in
> > >> > > > > > > > > > >> >
> > >> > > https://github.com/confluentinc/kafka/commits/kafka-raft
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > > Thanks,
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > Jun
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > > Jun
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > > On Wed, Apr 29, 2020 at 8:56 PM Guozhang
> > >> Wang <
> > >> > > > > > > > > > wangg...@gmail.com
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > > > wrote:
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > > > > Hello Leonard,
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > Thanks for your comments, I'm relying
> in
> > >> line
> > >> > > > below:
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > On Wed, Apr 29, 2020 at 1:58 AM Wang
> > >> (Leonard)
> > >> > > Ge
> > >> > > > <
> > >> > > > > > > > > > >> > w...@confluent.io>
> > >> > > > > > > > > > >> > > > > > wrote:
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > > Hi Kafka developers,
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > It's great to see this proposal and
> it
> > >> took
> > >> > me
> > >> > > > > some
> > >> > > > > > > time
> > >> > > > > > > > > to
> > >> > > > > > > > > > >> > finish
> > >> > > > > > > > > > >> > > > > > reading
> > >> > > > > > > > > > >> > > > > > > it.
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > And I have the following questions
> > about
> > >> the
> > >> > > > > > Proposal:
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > >    - How do we plan to test this
> design
> > >> to
> > >> > > > ensure
> > >> > > > > > its
> > >> > > > > > > > > > >> > correctness?
> > >> > > > > > > > > > >> > > Or
> > >> > > > > > > > > > >> > > > > > more
> > >> > > > > > > > > > >> > > > > > >    broadly, how do we ensure that our
> > new
> > >> > > ‘pull’
> > >> > > > > > based
> > >> > > > > > > > > model
> > >> > > > > > > > > > >> is
> > >> > > > > > > > > > >> > > > > > functional
> > >> > > > > > > > > > >> > > > > > > and
> > >> > > > > > > > > > >> > > > > > >    correct given that it is different
> > >> from
> > >> > the
> > >> > > > > > > original
> > >> > > > > > > > > RAFT
> > >> > > > > > > > > > >> > > > > > implementation
> > >> > > > > > > > > > >> > > > > > >    which has formal proof of
> > correctness?
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > We have two planned verifications on
> the
> > >> > > > correctness
> > >> > > > > > and
> > >> > > > > > > > > > >> liveness
> > >> > > > > > > > > > >> > of
> > >> > > > > > > > > > >> > > > the
> > >> > > > > > > > > > >> > > > > > design. One is via model verification
> > >> (TLA+)
> > >> > > > > > > > > > >> > > > > >
> > >> > > > https://github.com/guozhangwang/kafka-specification
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > Another is via the concurrent
> simulation
> > >> tests
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >>
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/confluentinc/kafka/commit/5c0c054597d2d9f458cad0cad846b0671efa2d91
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >    - Have we considered any sensible
> > >> defaults
> > >> > > for
> > >> > > > > the
> > >> > > > > > > > > > >> > configuration,
> > >> > > > > > > > > > >> > > > i.e.
> > >> > > > > > > > > > >> > > > > > >    all the election timeout, fetch
> time
> > >> out,
> > >> > > > etc.?
> > >> > > > > > Or
> > >> > > > > > > we
> > >> > > > > > > > > > want
> > >> > > > > > > > > > >> to
> > >> > > > > > > > > > >> > > > leave
> > >> > > > > > > > > > >> > > > > > > this to
> > >> > > > > > > > > > >> > > > > > >    a later stage when we do the
> > >> performance
> > >> > > > > testing,
> > >> > > > > > > > etc.
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > This is a good question, the reason we
> > did
> > >> not
> > >> > > set
> > >> > > > > any
> > >> > > > > > > > > default
> > >> > > > > > > > > > >> > values
> > >> > > > > > > > > > >> > > > for
> > >> > > > > > > > > > >> > > > > > the timeout configurations is that we
> > >> think it
> > >> > > may
> > >> > > > > > take
> > >> > > > > > > > some
> > >> > > > > > > > > > >> > > > benchmarking
> > >> > > > > > > > > > >> > > > > > experiments to get these defaults
> right.
> > >> Some
> > >> > > > > > high-level
> > >> > > > > > > > > > >> principles
> > >> > > > > > > > > > >> > > to
> > >> > > > > > > > > > >> > > > > > consider: 1) the fetch.timeout should
> be
> > >> > around
> > >> > > > the
> > >> > > > > > same
> > >> > > > > > > > > scale
> > >> > > > > > > > > > >> with
> > >> > > > > > > > > > >> > > zk
> > >> > > > > > > > > > >> > > > > > session timeout, which is now 18
> seconds
> > by
> > >> > > > default
> > >> > > > > --
> > >> > > > > > > in
> > >> > > > > > > > > > >> practice
> > >> > > > > > > > > > >> > > > we've
> > >> > > > > > > > > > >> > > > > > seen unstable networks having more than
> > 10
> > >> > secs
> > >> > > of
> > >> > > > > > > > transient
> > >> > > > > > > > > > >> > > > > connectivity,
> > >> > > > > > > > > > >> > > > > > 2) the election.timeout, however,
> should
> > be
> > >> > > > smaller
> > >> > > > > > than
> > >> > > > > > > > the
> > >> > > > > > > > > > >> fetch
> > >> > > > > > > > > > >> > > > > timeout
> > >> > > > > > > > > > >> > > > > > as is also suggested as a practical
> > >> > optimization
> > >> > > > in
> > >> > > > > > > > > > literature:
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > >
> > >> https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > Some more discussions can be found
> here:
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > >
> > >> > > > https://github.com/confluentinc/kafka/pull/301/files#r415420081
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > >    - Have we considered piggybacking
> > >> > > > > > > `BeginQuorumEpoch`
> > >> > > > > > > > > with
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > `
> > >> > > > > > > > > > >> > > > > > >    FetchQuorumRecords`? I might be
> > >> missing
> > >> > > > > something
> > >> > > > > > > > > obvious
> > >> > > > > > > > > > >> but
> > >> > > > > > > > > > >> > I
> > >> > > > > > > > > > >> > > am
> > >> > > > > > > > > > >> > > > > > just
> > >> > > > > > > > > > >> > > > > > >    wondering why don’t we just use
> the
> > >> > > > > `FindQuorum`
> > >> > > > > > > and
> > >> > > > > > > > > > >> > > > > > > `FetchQuorumRecords`
> > >> > > > > > > > > > >> > > > > > >    APIs and remove the
> > `BeginQuorumEpoch`
> > >> > API?
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > Note that Begin/EndQuorumEpoch is sent
> > from
> > >> > > leader
> > >> > > > > ->
> > >> > > > > > > > other
> > >> > > > > > > > > > >> voter
> > >> > > > > > > > > > >> > > > > > followers, while FindQuorum / Fetch are
> > >> sent
> > >> > > from
> > >> > > > > > > follower
> > >> > > > > > > > > to
> > >> > > > > > > > > > >> > leader.
> > >> > > > > > > > > > >> > > > > > Arguably one can eventually realize the
> > new
> > >> > > leader
> > >> > > > > and
> > >> > > > > > > > epoch
> > >> > > > > > > > > > via
> > >> > > > > > > > > > >> > > > > gossiping
> > >> > > > > > > > > > >> > > > > > FindQuorum, but that could in practice
> > >> > require a
> > >> > > > > long
> > >> > > > > > > > delay.
> > >> > > > > > > > > > >> > Having a
> > >> > > > > > > > > > >> > > > > > leader -> other voters request helps
> the
> > >> new
> > >> > > > leader
> > >> > > > > > > epoch
> > >> > > > > > > > to
> > >> > > > > > > > > > be
> > >> > > > > > > > > > >> > > > > propagated
> > >> > > > > > > > > > >> > > > > > faster under a pull model.
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > >    - And about the
> `FetchQuorumRecords`
> > >> > > response
> > >> > > > > > > schema,
> > >> > > > > > > > > in
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > > > `Records`
> > >> > > > > > > > > > >> > > > > > >    field of the response, is it just
> > one
> > >> > > record
> > >> > > > or
> > >> > > > > > all
> > >> > > > > > > > the
> > >> > > > > > > > > > >> > records
> > >> > > > > > > > > > >> > > > > > starting
> > >> > > > > > > > > > >> > > > > > >    from the FetchOffset? It seems a
> lot
> > >> more
> > >> > > > > > efficient
> > >> > > > > > > > if
> > >> > > > > > > > > we
> > >> > > > > > > > > > >> sent
> > >> > > > > > > > > > >> > > all
> > >> > > > > > > > > > >> > > > > the
> > >> > > > > > > > > > >> > > > > > >    records during the bootstrapping
> of
> > >> the
> > >> > > > > brokers.
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > Yes the fetching is batched:
> FetchOffset
> > is
> > >> > just
> > >> > > > the
> > >> > > > > > > > > starting
> > >> > > > > > > > > > >> > offset
> > >> > > > > > > > > > >> > > of
> > >> > > > > > > > > > >> > > > > the
> > >> > > > > > > > > > >> > > > > > batch of records.
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > >    - Regarding the disruptive broker
> > >> issues,
> > >> > > > does
> > >> > > > > > our
> > >> > > > > > > > pull
> > >> > > > > > > > > > >> based
> > >> > > > > > > > > > >> > > > model
> > >> > > > > > > > > > >> > > > > > >    suffer from it? If so, have we
> > >> considered
> > >> > > the
> > >> > > > > > > > Pre-Vote
> > >> > > > > > > > > > >> stage?
> > >> > > > > > > > > > >> > If
> > >> > > > > > > > > > >> > > > > not,
> > >> > > > > > > > > > >> > > > > > > why?
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > The disruptive broker is stated in the
> > >> > original
> > >> > > > Raft
> > >> > > > > > > paper
> > >> > > > > > > > > > >> which is
> > >> > > > > > > > > > >> > > the
> > >> > > > > > > > > > >> > > > > > result of the push model design. Our
> > >> analysis
> > >> > > > showed
> > >> > > > > > > that
> > >> > > > > > > > > with
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > pull
> > >> > > > > > > > > > >> > > > > > model it is no longer an issue.
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > > Thanks a lot for putting this up,
> and I
> > >> hope
> > >> > > > that
> > >> > > > > my
> > >> > > > > > > > > > questions
> > >> > > > > > > > > > >> > can
> > >> > > > > > > > > > >> > > be
> > >> > > > > > > > > > >> > > > > of
> > >> > > > > > > > > > >> > > > > > > some value to make this KIP better.
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > Hope to hear from you soon!
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > Best wishes,
> > >> > > > > > > > > > >> > > > > > > Leonard
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > On Wed, Apr 29, 2020 at 1:46 AM Colin
> > >> > McCabe <
> > >> > > > > > > > > > >> cmcc...@apache.org
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> > > > > wrote:
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > > Hi Jason,
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > It's amazing to see this coming
> > >> together
> > >> > :)
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > I haven't had a chance to read in
> > >> detail,
> > >> > > but
> > >> > > > I
> > >> > > > > > read
> > >> > > > > > > > the
> > >> > > > > > > > > > >> > outline
> > >> > > > > > > > > > >> > > > and
> > >> > > > > > > > > > >> > > > > a
> > >> > > > > > > > > > >> > > > > > > few
> > >> > > > > > > > > > >> > > > > > > > things jumped out at me.
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > First, for every epoch that is 32
> > bits
> > >> > > rather
> > >> > > > > than
> > >> > > > > > > > 64, I
> > >> > > > > > > > > > >> sort
> > >> > > > > > > > > > >> > of
> > >> > > > > > > > > > >> > > > > wonder
> > >> > > > > > > > > > >> > > > > > > if
> > >> > > > > > > > > > >> > > > > > > > that's a good long-term choice.  I
> > keep
> > >> > > > reading
> > >> > > > > > > about
> > >> > > > > > > > > > stuff
> > >> > > > > > > > > > >> > like
> > >> > > > > > > > > > >> > > > > this:
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> > >> > > > > > > > .
> > >> > > > > > > > > > >> > > Obviously,
> > >> > > > > > > > > > >> > > > > > that
> > >> > > > > > > > > > >> > > > > > > > JIRA is about zxid, which
> increments
> > >> much
> > >> > > > faster
> > >> > > > > > > than
> > >> > > > > > > > we
> > >> > > > > > > > > > >> expect
> > >> > > > > > > > > > >> > > > these
> > >> > > > > > > > > > >> > > > > > > > leader epochs to, but it would
> still
> > be
> > >> > good
> > >> > > > to
> > >> > > > > > see
> > >> > > > > > > > some
> > >> > > > > > > > > > >> rough
> > >> > > > > > > > > > >> > > > > > > calculations
> > >> > > > > > > > > > >> > > > > > > > about how long 32 bits (or really,
> 31
> > >> > bits)
> > >> > > > will
> > >> > > > > > > last
> > >> > > > > > > > us
> > >> > > > > > > > > > in
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > > cases
> > >> > > > > > > > > > >> > > > > > > where
> > >> > > > > > > > > > >> > > > > > > > we're using it, and what the space
> > >> savings
> > >> > > > we're
> > >> > > > > > > > getting
> > >> > > > > > > > > > >> really
> > >> > > > > > > > > > >> > > is.
> > >> > > > > > > > > > >> > > > > It
> > >> > > > > > > > > > >> > > > > > > > seems like in most cases the
> tradeoff
> > >> may
> > >> > > not
> > >> > > > be
> > >> > > > > > > worth
> > >> > > > > > > > > it?
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > Another thing I've been thinking
> > about
> > >> is
> > >> > > how
> > >> > > > we
> > >> > > > > > do
> > >> > > > > > > > > > >> > > > bootstrapping.  I
> > >> > > > > > > > > > >> > > > > > > > would prefer to be in a world where
> > >> > > > formatting a
> > >> > > > > > new
> > >> > > > > > > > > Kafka
> > >> > > > > > > > > > >> node
> > >> > > > > > > > > > >> > > > was a
> > >> > > > > > > > > > >> > > > > > > first
> > >> > > > > > > > > > >> > > > > > > > class operation explicitly
> initiated
> > by
> > >> > the
> > >> > > > > admin,
> > >> > > > > > > > > rather
> > >> > > > > > > > > > >> than
> > >> > > > > > > > > > >> > > > > > something
> > >> > > > > > > > > > >> > > > > > > > that happened implicitly when you
> > >> started
> > >> > up
> > >> > > > the
> > >> > > > > > > > broker
> > >> > > > > > > > > > and
> > >> > > > > > > > > > >> > > things
> > >> > > > > > > > > > >> > > > > > > "looked
> > >> > > > > > > > > > >> > > > > > > > blank."
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > The first problem is that things
> can
> > >> "look
> > >> > > > > blank"
> > >> > > > > > > > > > >> accidentally
> > >> > > > > > > > > > >> > if
> > >> > > > > > > > > > >> > > > the
> > >> > > > > > > > > > >> > > > > > > > storage system is having a bad day.
> > >> > Clearly
> > >> > > > in
> > >> > > > > > the
> > >> > > > > > > > > > non-Raft
> > >> > > > > > > > > > >> > > world,
> > >> > > > > > > > > > >> > > > > > this
> > >> > > > > > > > > > >> > > > > > > > leads to data loss if the broker
> that
> > >> is
> > >> > > > > > (re)started
> > >> > > > > > > > > this
> > >> > > > > > > > > > >> way
> > >> > > > > > > > > > >> > was
> > >> > > > > > > > > > >> > > > the
> > >> > > > > > > > > > >> > > > > > > > leader for some partitions.
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > The second problem is that we have
> a
> > >> bit
> > >> > of
> > >> > > a
> > >> > > > > > > chicken
> > >> > > > > > > > > and
> > >> > > > > > > > > > >> egg
> > >> > > > > > > > > > >> > > > problem
> > >> > > > > > > > > > >> > > > > > > with
> > >> > > > > > > > > > >> > > > > > > > certain configuration keys.  For
> > >> example,
> > >> > > > maybe
> > >> > > > > > you
> > >> > > > > > > > want
> > >> > > > > > > > > > to
> > >> > > > > > > > > > >> > > > configure
> > >> > > > > > > > > > >> > > > > > > some
> > >> > > > > > > > > > >> > > > > > > > connection security settings in
> your
> > >> > > cluster,
> > >> > > > > but
> > >> > > > > > > you
> > >> > > > > > > > > > don't
> > >> > > > > > > > > > >> > want
> > >> > > > > > > > > > >> > > > them
> > >> > > > > > > > > > >> > > > > > to
> > >> > > > > > > > > > >> > > > > > > > ever be stored in a plaintext
> config
> > >> file.
> > >> > > > (For
> > >> > > > > > > > > example,
> > >> > > > > > > > > > >> SCRAM
> > >> > > > > > > > > > >> > > > > > > passwords,
> > >> > > > > > > > > > >> > > > > > > > etc.)  You could use a broker API
> to
> > >> set
> > >> > the
> > >> > > > > > > > > > configuration,
> > >> > > > > > > > > > >> but
> > >> > > > > > > > > > >> > > > that
> > >> > > > > > > > > > >> > > > > > > brings
> > >> > > > > > > > > > >> > > > > > > > up the chicken and egg problem.
> The
> > >> > broker
> > >> > > > > needs
> > >> > > > > > to
> > >> > > > > > > > be
> > >> > > > > > > > > > >> > > configured
> > >> > > > > > > > > > >> > > > to
> > >> > > > > > > > > > >> > > > > > > know
> > >> > > > > > > > > > >> > > > > > > > how to talk to you, but you need to
> > >> > > configure
> > >> > > > it
> > >> > > > > > > > before
> > >> > > > > > > > > > you
> > >> > > > > > > > > > >> can
> > >> > > > > > > > > > >> > > > talk
> > >> > > > > > > > > > >> > > > > to
> > >> > > > > > > > > > >> > > > > > > > it.  Using an external secret
> manager
> > >> like
> > >> > > > Vault
> > >> > > > > > is
> > >> > > > > > > > one
> > >> > > > > > > > > > way
> > >> > > > > > > > > > >> to
> > >> > > > > > > > > > >> > > > solve
> > >> > > > > > > > > > >> > > > > > > this,
> > >> > > > > > > > > > >> > > > > > > > but not everyone uses an external
> > >> secret
> > >> > > > > manager.
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > quorum.voters seems like a similar
> > >> > > > configuration
> > >> > > > > > > key.
> > >> > > > > > > > > In
> > >> > > > > > > > > > >> the
> > >> > > > > > > > > > >> > > > current
> > >> > > > > > > > > > >> > > > > > > KIP,
> > >> > > > > > > > > > >> > > > > > > > this is only read if there is no
> > other
> > >> > > > > > configuration
> > >> > > > > > > > > > >> specifying
> > >> > > > > > > > > > >> > > the
> > >> > > > > > > > > > >> > > > > > > quorum
> > >> > > > > > > > > > >> > > > > > > > voter set.  If we had a kafka.mkfs
> > >> > command,
> > >> > > we
> > >> > > > > > > > wouldn't
> > >> > > > > > > > > > need
> > >> > > > > > > > > > >> > this
> > >> > > > > > > > > > >> > > > key
> > >> > > > > > > > > > >> > > > > > > > because we could assume that there
> > was
> > >> > > always
> > >> > > > > > quorum
> > >> > > > > > > > > > >> > information
> > >> > > > > > > > > > >> > > > > stored
> > >> > > > > > > > > > >> > > > > > > > locally.
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > best,
> > >> > > > > > > > > > >> > > > > > > > Colin
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > On Thu, Apr 16, 2020, at 16:44,
> Jason
> > >> > > > Gustafson
> > >> > > > > > > wrote:
> > >> > > > > > > > > > >> > > > > > > > > Hi All,
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > > I'd like to start a discussion on
> > >> > KIP-595:
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >>
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > >> > > > > > > > > > >> > > > > > > > .
> > >> > > > > > > > > > >> > > > > > > > > This proposal specifies a Raft
> > >> protocol
> > >> > to
> > >> > > > > > > > ultimately
> > >> > > > > > > > > > >> replace
> > >> > > > > > > > > > >> > > > > > Zookeeper
> > >> > > > > > > > > > >> > > > > > > > > as
> > >> > > > > > > > > > >> > > > > > > > > documented in KIP-500. Please
> take
> > a
> > >> > look
> > >> > > > and
> > >> > > > > > > share
> > >> > > > > > > > > your
> > >> > > > > > > > > > >> > > > thoughts.
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > > A few minor notes to set the
> stage
> > a
> > >> > > little
> > >> > > > > bit:
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > > - This KIP does not specify the
> > >> > structure
> > >> > > of
> > >> > > > > the
> > >> > > > > > > > > > messages
> > >> > > > > > > > > > >> > used
> > >> > > > > > > > > > >> > > to
> > >> > > > > > > > > > >> > > > > > > > represent
> > >> > > > > > > > > > >> > > > > > > > > metadata in Kafka, nor does it
> > >> specify
> > >> > the
> > >> > > > > > > internal
> > >> > > > > > > > > API
> > >> > > > > > > > > > >> that
> > >> > > > > > > > > > >> > > will
> > >> > > > > > > > > > >> > > > > be
> > >> > > > > > > > > > >> > > > > > > used
> > >> > > > > > > > > > >> > > > > > > > > by the controller. Expect these
> to
> > >> come
> > >> > in
> > >> > > > > later
> > >> > > > > > > > > > >> proposals.
> > >> > > > > > > > > > >> > > Here
> > >> > > > > > > > > > >> > > > we
> > >> > > > > > > > > > >> > > > > > are
> > >> > > > > > > > > > >> > > > > > > > > primarily concerned with the
> > >> replication
> > >> > > > > > protocol
> > >> > > > > > > > and
> > >> > > > > > > > > > >> basic
> > >> > > > > > > > > > >> > > > > > operational
> > >> > > > > > > > > > >> > > > > > > > > mechanics.
> > >> > > > > > > > > > >> > > > > > > > > - We expect many details to
> change
> > >> as we
> > >> > > get
> > >> > > > > > > closer
> > >> > > > > > > > to
> > >> > > > > > > > > > >> > > > integration
> > >> > > > > > > > > > >> > > > > > with
> > >> > > > > > > > > > >> > > > > > > > > the controller. Any changes we
> make
> > >> will
> > >> > > be
> > >> > > > > made
> > >> > > > > > > > > either
> > >> > > > > > > > > > as
> > >> > > > > > > > > > >> > > > > amendments
> > >> > > > > > > > > > >> > > > > > > to
> > >> > > > > > > > > > >> > > > > > > > > this KIP or, in the case of
> larger
> > >> > > changes,
> > >> > > > as
> > >> > > > > > new
> > >> > > > > > > > > > >> proposals.
> > >> > > > > > > > > > >> > > > > > > > > - We have a prototype
> > implementation
> > >> > > which I
> > >> > > > > > will
> > >> > > > > > > > put
> > >> > > > > > > > > > >> online
> > >> > > > > > > > > > >> > > > within
> > >> > > > > > > > > > >> > > > > > the
> > >> > > > > > > > > > >> > > > > > > > > next week which may help in
> > >> > understanding
> > >> > > > some
> > >> > > > > > > > > details.
> > >> > > > > > > > > > It
> > >> > > > > > > > > > >> > has
> > >> > > > > > > > > > >> > > > > > > diverged a
> > >> > > > > > > > > > >> > > > > > > > > little bit from our proposal, so
> I
> > am
> > >> > > > taking a
> > >> > > > > > > > little
> > >> > > > > > > > > > >> time to
> > >> > > > > > > > > > >> > > > bring
> > >> > > > > > > > > > >> > > > > > it
> > >> > > > > > > > > > >> > > > > > > in
> > >> > > > > > > > > > >> > > > > > > > > line. I'll post an update to this
> > >> thread
> > >> > > > when
> > >> > > > > it
> > >> > > > > > > is
> > >> > > > > > > > > > >> available
> > >> > > > > > > > > > >> > > for
> > >> > > > > > > > > > >> > > > > > > review.
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > > Finally, I want to mention that
> > this
> > >> > > > proposal
> > >> > > > > > was
> > >> > > > > > > > > > drafted
> > >> > > > > > > > > > >> by
> > >> > > > > > > > > > >> > > > > myself,
> > >> > > > > > > > > > >> > > > > > > > Boyang
> > >> > > > > > > > > > >> > > > > > > > > Chen, and Guozhang Wang.
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > > > Thanks,
> > >> > > > > > > > > > >> > > > > > > > > Jason
> > >> > > > > > > > > > >> > > > > > > > >
> > >> > > > > > > > > > >> > > > > > > >
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > > > --
> > >> > > > > > > > > > >> > > > > > > Leonard Ge
> > >> > > > > > > > > > >> > > > > > > Software Engineer Intern - Confluent
> > >> > > > > > > > > > >> > > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > > > --
> > >> > > > > > > > > > >> > > > > > -- Guozhang
> > >> > > > > > > > > > >> > > > > >
> > >> > > > > > > > > > >> > > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > > > --
> > >> > > > > > > > > > >> > > > -- Guozhang
> > >> > > > > > > > > > >> > > >
> > >> > > > > > > > > > >> > >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >> > --
> > >> > > > > > > > > > >> > -- Guozhang
> > >> > > > > > > > > > >> >
> > >> > > > > > > > > > >>
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > > -- Guozhang
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > -- Guozhang
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to