Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Colin McCabe Tue, 19 Sep 2023 13:30:18 -0700

Hi Calvin,

Thanks for the explanations. I like the idea of using none, balanced, 
aggressive. We also had an offline discussion about why it is good to use a new 
config key (basically, so that we can deprecate the old one which had only 
false/true values in 4.0) With these changes, I am +1.


best,
Colin

On Mon, Sep 18, 2023, at 15:54, Calvin Liu wrote:
> Hi Colin,
> Also, can we deprecate unclean.leader.election.enable in 4.0? Before that,
> we can have both the config unclean.recovery.strategy and
> unclean.leader.election.enable
> and using the unclean.recovery.Enabled to determine which config to use
> during the unclean leader election.
>
> On Mon, Sep 18, 2023 at 3:51 PM Calvin Liu <ca...@confluent.io> wrote:
>
>> Hi Colin,
>> For the unclean.recovery.strategy config name, how about we use the
>> following
>> None. It basically means no unclean recovery will be performed.
>> Aggressive. It means availability goes first. Whenever the partition can't
>> elect a durable replica, the controller will try the unclean recovery.
>> Balanced. It is the balance point of the availability first(Aggressive)
>> and least availability(None). The controller performs unclean recovery when
>> both ISR and ELR are empty.
>>
>>
>> On Fri, Sep 15, 2023 at 11:42 AM Calvin Liu <ca...@confluent.io> wrote:
>>
>>> Hi Colin,
>>>
>>> > So, the proposal is that if someone sets "unclean.leader.election.enable
>>> = true"...
>>>
>>>
>>> The idea is to use one of the unclean.leader.election.enable and
>>> unclean.recovery.strategy based on the unclean.recovery.Enabled. A possible
>>> version can be
>>>
>>> If unclean.recovery.Enabled:
>>>
>>> {
>>>
>>> Check unclean.recovery.strategy. If set, use it. Otherwise, check
>>> unclean.leader.election.enable and translate it to
>>> unclean.recovery.strategy.
>>>
>>> } else {
>>>
>>> Use unclean.leader.election.enable
>>>
>>> }
>>>
>>>
>>> —--------
>>>
>>> >The configuration key should be "unclean.recovery.manager.enabled",
>>> right?
>>>
>>>
>>> I think we have two ways of choosing a leader uncleanly, unclean leader
>>> election and unclean recovery(log inspection) and we try to switch between
>>> them.
>>>
>>> Do you mean we want to develop two ways of performing the unclean
>>> recovery and one of them is using “unclean recovery manager”? I guess we
>>> haven’t discussed the second way.
>>>
>>>
>>> —-------
>>>
>>> >How do these 4 levels of overrides interact with your new
>>> configurations?
>>>
>>>
>>> I do notice in the Kraft controller code, the method to check whether
>>> perform unclean leader election is hard coded to false since
>>> 2021(uncleanLeaderElectionEnabledForTopic). Isn’t it a good chance to
>>> completely deprecate the unclean.leader.election.enable? We don’t even have
>>> to worry about the config conversion.
>>>
>>> On the other hand, whatever the override is, as long as the controller
>>> can have the final effective unclean.leader.election.enable, the topic
>>> level config unclean.recovery.strategy, the cluster level config
>>> unclean.recovery.Enabled, the controller can calculate the correct methods
>>> to use right?
>>>
>>>
>>> On Fri, Sep 15, 2023 at 10:02 AM Colin McCabe <cmcc...@apache.org> wrote:
>>>
>>>> On Thu, Sep 14, 2023, at 22:23, Calvin Liu wrote:
>>>> > Hi Colin
>>>> > 1. I think using the new config name is more clear.
>>>> >        a. The unclean leader election is actually removed if unclean
>>>> > recovery is in use.
>>>> >        b. Using multiple values in unclean.leader.election.enable is
>>>> > confusing and it will be more confusing after people forget about this
>>>> > discussion.
>>>>
>>>> Hi Calvin,
>>>>
>>>> So, the proposal is that if someone sets "unclean.leader.election.enable
>>>> = true" but then sets one of your new configurations, the value of
>>>> unclean.leader.election.enable is ignored? That seems less clear to me, not
>>>> more. Just in general, having multiple configuration keys to control the
>>>> same thing confuses users. Basically, they are sitting at a giant control
>>>> panel, and some of the levers do nothing.
>>>>
>>>> > 2. Sorry I forgot to mention in the response that I did add the
>>>> > unclean.recovery.Enabled flag.
>>>>
>>>> The configuration key should be "unclean.recovery.manager.enabled",
>>>> right? Becuase we can do "unclean recovery" without the manager. Disabling
>>>> the manager just means we use a different mechanism for recovery.
>>>>
>>>> >        c. Maybe I underestimated the challenge of replacing the
>>>> config. Any
>>>> > implementation problems ahead?
>>>>
>>>> There are four levels of overrides for unclean.leader.election.enable.
>>>>
>>>> 1. static configuration for node.
>>>>     This goes in the configuration file, typically named
>>>> server.properties
>>>>
>>>> 2. dynamic configuration for node default
>>>>   ConfigResource(type=BROKER, name="")
>>>>
>>>> 3. dynamic configuration for node
>>>>   ConfigResource(type=BROKER, name=<controller id>)
>>>>
>>>> 4. dynamic configuration for topic
>>>>   ConfigResource(type=TOPIC, name=<topic-name>)
>>>>
>>>> How do these 4 levels of overrides interact with your new
>>>> configurations? If the new configurations dominate over the old ones, it
>>>> seems like this will get a lot more confusing to implement (and also to
>>>> use.)
>>>>
>>>> Again, I'd recommend just adding some new values to
>>>> unclean.leader.election.enable. It's simple and will prevent user confusion
>>>> (as well as developer confusion.)
>>>>
>>>> best,
>>>> Colin
>>>>
>>>>
>>>> > 3. About the admin client, I mentioned 3 changes in the client.
>>>> Anything
>>>> > else I missed in the KIP?
>>>> >       a. The client will switch to using the new RPC instead of
>>>> > MetadataRequest for the topics.
>>>> >       b. The TopicPartitionInfo used in TopicDescription needs to add
>>>> new
>>>> > fields related to the ELR.
>>>> >       c. The outputs will add the ELR related fields.
>>>> >
>>>> > On Thu, Sep 14, 2023 at 9:19 PM Colin McCabe <cmcc...@apache.org>
>>>> wrote:
>>>> >
>>>> >> Hi Calvin,
>>>> >>
>>>> >> Thanks for the changes.
>>>> >>
>>>> >> 1. Earlier I commented that creating "unclean.recovery.strategy " is
>>>> not
>>>> >> necessary, and we can just reuse the existing
>>>> >> "unclean.leader.election.enable" configuration key. Let's discuss
>>>> that.
>>>> >>
>>>> >> 2.I also don't understand why you didn't add a configuration to
>>>> enable or
>>>> >> disable the Unclean Recovery Manager. This seems like a very simple
>>>> way to
>>>> >> handle the staging issue which we discussed. The URM can just be
>>>> turned off
>>>> >> until it is production ready. Let's discuss this.
>>>> >>
>>>> >> 3. You still need to describe the changes to AdminClient that are
>>>> needed
>>>> >> to use DescribeTopicRequest.
>>>> >>
>>>> >> Keep at it. It's looking better. :)
>>>> >>
>>>> >> best,
>>>> >> Colin
>>>> >>
>>>> >>
>>>> >> On Thu, Sep 14, 2023, at 11:03, Calvin Liu wrote:
>>>> >> > Hi Colin
>>>> >> > Thanks for the comments!
>>>> >> >
>>>> >> > I did the following changes
>>>> >> >
>>>> >> >    1.
>>>> >> >
>>>> >> >    Simplified the API spec section to only include the diff.
>>>> >> >    2.
>>>> >> >
>>>> >> >    Reordered the HWM requirement section.
>>>> >> >    3.
>>>> >> >
>>>> >> >    Removed the URM implementation details to keep the necessary
>>>> >> >    characteristics to perform the unclean recovery.
>>>> >> >    1.
>>>> >> >
>>>> >> >       When to perform the unclean recovery
>>>> >> >       2.
>>>> >> >
>>>> >> >       Under different config, how the unclean recovery finds the
>>>> leader.
>>>> >> >       3.
>>>> >> >
>>>> >> >       How the config unclean.leader.election.enable and
>>>> >> >       unclean.recovery.strategy are converted when users
>>>> enable/disable
>>>> >> the
>>>> >> >       unclean recovery.
>>>> >> >       4.
>>>> >> >
>>>> >> >    More details about how we change admin client.
>>>> >> >    5.
>>>> >> >
>>>> >> >    API limits on the GetReplicaLogInfoRequest and
>>>> DescribeTopicRequest.
>>>> >> >    6.
>>>> >> >
>>>> >> >    Two metrics added
>>>> >> >    1.
>>>> >> >
>>>> >> >       Kafka.controller.global_under_min_isr_partition_count
>>>> >> >       2.
>>>> >> >
>>>> >> >       kafka.controller.unclean_recovery_finished_count
>>>> >> >
>>>> >> >
>>>> >> > On Wed, Sep 13, 2023 at 10:46 AM Colin McCabe <cmcc...@apache.org>
>>>> >> wrote:
>>>> >> >
>>>> >> >> On Tue, Sep 12, 2023, at 17:21, Calvin Liu wrote:
>>>> >> >> > Hi Colin
>>>> >> >> > Thanks for the comments!
>>>> >> >> >
>>>> >> >>
>>>> >> >> Hi Calvin,
>>>> >> >>
>>>> >> >> Thanks again for the KIP.
>>>> >> >>
>>>> >> >> One meta-comment: it's usually better to just do a diff on a
>>>> message
>>>> >> spec
>>>> >> >> file or java file if you're including changes to it in the KIP.
>>>> This is
>>>> >> >> easier to read than looking for "new fields begin" etc. in the
>>>> text, and
>>>> >> >> gracefully handles the case where existing fields were changed.
>>>> >> >>
>>>> >> >> > Rewrite the Additional High Watermark advancement requirement
>>>> >> >> > There was feedback on this section that some readers may not be
>>>> >> familiar
>>>> >> >> > with HWM and Ack=0,1,all requests. This can help them understand
>>>> the
>>>> >> >> > proposal. I will rewrite this part for more readability.
>>>> >> >> >
>>>> >> >>
>>>> >> >> To be clear, I wasn't suggesting dropping either section. I agree
>>>> that
>>>> >> >> they add useful background. I was just suggesting that we should
>>>> discuss
>>>> >> >> the "acks" setting AFTER discussing the new high watermark
>>>> advancement
>>>> >> >> conditions. We also should discuss acks=0. While it isn't
>>>> conceptually
>>>> >> much
>>>> >> >> different than acks=1 here, its omission from this section is
>>>> confusing.
>>>> >> >>
>>>> >> >> > Unclean recovery
>>>> >> >> >
>>>> >> >> > The plan is to replace the unclean.leader.election.enable with
>>>> >> >> > unclean.recovery.strategy. If the Unclean Recovery is enabled
>>>> then it
>>>> >> >> deals
>>>> >> >> > with the three options in the unclean.recovery.strategy.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Let’s refine the Unclean Recovery. We have already taken a lot of
>>>> >> >> > suggestions and I hope to enhance the durability of Kafka to the
>>>> next
>>>> >> >> level
>>>> >> >> > with this KIP.
>>>> >> >>
>>>> >> >> I am OK with doing the unclean leader recovery improvements in
>>>> this KIP.
>>>> >> >> However, I think we need to really work on the configuration
>>>> settings.
>>>> >> >>
>>>> >> >> Configuration overrides are often quite messy. For example, the
>>>> cases
>>>> >> >> where we have log.roll.hours and log.roll.segment.ms, the user
>>>> has to
>>>> >> >> remember which one takes precedence, and it is not obvious. So,
>>>> rather
>>>> >> than
>>>> >> >> creating a new configuration, why not add additional values to
>>>> >> >> "unclean.leader.election.enable"? I think this will be simpler for
>>>> >> people
>>>> >> >> to understand, and simpler in the code as well.
>>>> >> >>
>>>> >> >> What if we continued to use "unclean.leader.election.enable" but
>>>> >> extended
>>>> >> >> it so that it took a string? Then the string could have these
>>>> values:
>>>> >> >>
>>>> >> >> never
>>>> >> >>     never automatically do an unclean leader election under any
>>>> >> conditions
>>>> >> >>
>>>> >> >> false / default
>>>> >> >>     only do an unclean leader election if there may be possible
>>>> data
>>>> >> loss
>>>> >> >>
>>>> >> >> true / always
>>>> >> >>     always do an unclean leader election if we can't immediately
>>>> elect a
>>>> >> >> leader
>>>> >> >>
>>>> >> >> It's a bit awkward that false maps to default rather than to
>>>> never. But
>>>> >> >> this awkwardness exists if we use two different configuration keys
>>>> as
>>>> >> well.
>>>> >> >> The reason for the awkwardness is that we simply don't want most
>>>> of the
>>>> >> >> people currently setting unclean.leader.election.enable=false to
>>>> get the
>>>> >> >> "never" behavior. We have to bite that bullet. Better to be clear
>>>> and
>>>> >> >> explicit than hide it.
>>>> >> >>
>>>> >> >> Another thing that's a bit awkward is having two different ways to
>>>> do
>>>> >> >> unclean leader election specified in the KIP. You descirbe two
>>>> methods:
>>>> >> the
>>>> >> >> simple "choose the last leader" method, and the "unclean recovery
>>>> >> manager"
>>>> >> >> method. I understand why you did it this way -- "choose the last
>>>> >> leader" is
>>>> >> >> simple, and will help us deliver an implementation quickly, while
>>>> the
>>>> >> URM
>>>> >> >> is preferable in the long term. My suggestion here is to separate
>>>> the
>>>> >> >> decision of HOW to do unclean leader election from the decision of
>>>> WHEN
>>>> >> to
>>>> >> >> do it.
>>>> >> >>
>>>> >> >> So in other words, have "unclean.leader.election.enable" specify
>>>> when we
>>>> >> >> do unclean leader election, and have a new configuration like
>>>> >> >> "unclean.recovery.manager.enable" to determine if we use the URM.
>>>> >> >> Presumably the URM will take some time to get fully stable, so
>>>> this can
>>>> >> >> default to false for a while, and we can flip the default to true
>>>> when
>>>> >> we
>>>> >> >> feel ready.
>>>> >> >>
>>>> >> >> The URM is somewhat under-described here. I think we need a few
>>>> >> >> configurations here for it. For example, we need a configuration to
>>>> >> specify
>>>> >> >> how long it should wait for a broker to respond to its RPCs before
>>>> >> moving
>>>> >> >> on. We also need to understand how the URM interacts with
>>>> >> >> unclean.leader.election.enable=always. I assume that with "always"
>>>> we
>>>> >> will
>>>> >> >> just unconditionally use the URM rather than choosing randomly.
>>>> But this
>>>> >> >> should be spelled out in the KIP.
>>>> >> >>
>>>> >> >> >
>>>> >> >> > DescribeTopicRequest
>>>> >> >> >
>>>> >> >> >    1.
>>>> >> >> >    Yes, the plan is to replace the MetadataRequest with the
>>>> >> >> >    DescribeTopicRequest for the admin clients. Will check the
>>>> details.
>>>> >> >>
>>>> >> >> Sounds good. But as I said, you need to specify how AdminClient
>>>> >> interacts
>>>> >> >> with the new request. This will involve adding some fields to
>>>> >> >> TopicDescription.java. And you need to specify the changes to the
>>>> >> >> kafka-topics.sh command line tool. Otherwise we cannot use the
>>>> tool to
>>>> >> see
>>>> >> >> the new information.
>>>> >> >>
>>>> >> >> The new requests, DescribeTopicRequest and
>>>> GetReplicaLogInfoRequest,
>>>> >> need
>>>> >> >> to have limits placed on them so that their size can't be
>>>> infinite. We
>>>> >> >> don't want to propagate the current problems of MetadataRequest,
>>>> where
>>>> >> >> clients can request massive responses that can mess up the JVM when
>>>> >> handled.
>>>> >> >>
>>>> >> >> Adding limits is simple for GetReplicaLogInfoRequest -- we can
>>>> just say
>>>> >> >> that only 2000 partitions at a time can be requested. For
>>>> >> >> DescribeTopicRequest we can probably just limit to 20 topics or
>>>> >> something
>>>> >> >> like that, to avoid the complexity of doing pagination in this KIP.
>>>> >> >>
>>>> >> >> >    2.
>>>> >> >> >    I can let the broker load the ELR info so that they can serve
>>>> the
>>>> >> >> >    DescribeTopicRequest as well.
>>>> >> >>
>>>> >> >> Yes, it's fine to add to MetadataCache. In fact, you'll be loading
>>>> it
>>>> >> >> anyway once it's added to PartitionImage.
>>>> >> >>
>>>> >> >> >    3.
>>>> >> >> >    Yeah, it does not make sense to have the topic id if
>>>> >> >> >    DescribeTopicRequest is only used by the admin client.
>>>> >> >>
>>>> >> >> OK. That makes things simpler. We can always create a new API later
>>>> >> >> (hopefully not in this KIP!) to query by topic ID.
>>>> >> >>
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Metrics
>>>> >> >> >
>>>> >> >> > As for overall cluster health metrics, I think under-min-ISR is
>>>> still
>>>> >> a
>>>> >> >> > useful one. ELR is more like a safety belt. When the ELR is
>>>> used, the
>>>> >> >> > cluster availability has already been impacted.
>>>> >> >> >
>>>> >> >> > Maybe we can have a metric to count the partitions that sum(ISR,
>>>> ELR)
>>>> >> <
>>>> >> >> min
>>>> >> >> > ISR. What do you think?
>>>> >> >>
>>>> >> >> How about:
>>>> >> >>
>>>> >> >> A.  a metric for the totoal number of under-min-isr partitions? We
>>>> don't
>>>> >> >> have that in Apache Kafka at the moment.
>>>> >> >>
>>>> >> >> B. a metric for the number of unclean leader elections we did (for
>>>> >> >> simplicity, it can reset to 0 on controller restart: we expect
>>>> people to
>>>> >> >> monitor the change over time anyway)
>>>> >> >>
>>>> >> >> best,
>>>> >> >> Colin
>>>> >> >>
>>>> >> >>
>>>> >> >> >
>>>> >> >> > Yeah, for the ongoing unclean recoveries, the controller can
>>>> keep an
>>>> >> >> > accurate count through failover because partition registration
>>>> can
>>>> >> >> indicate
>>>> >> >> > whether a recovery is needed. However, for the happened ones,
>>>> unless
>>>> >> we
>>>> >> >> > want to persist the number somewhere, we can only figure it out
>>>> from
>>>> >> the
>>>> >> >> > log.
>>>> >> >> >
>>>> >> >> > On Tue, Sep 12, 2023 at 3:16 PM Colin McCabe <cmcc...@apache.org
>>>> >
>>>> >> wrote:
>>>> >> >> >
>>>> >> >> >> Also, we should have metrics that show what is going on with
>>>> regard
>>>> >> to
>>>> >> >> the
>>>> >> >> >> eligible replica set. I'm not sure exactly what to suggest, but
>>>> >> >> something
>>>> >> >> >> that could identify when things are going wrong in the clsuter.
>>>> >> >> >>
>>>> >> >> >> For example, maybe a metric for partitions containing replicas
>>>> that
>>>> >> are
>>>> >> >> >> ineligible to be leader? That would show a spike when a broker
>>>> had an
>>>> >> >> >> unclean restart.
>>>> >> >> >>
>>>> >> >> >> Ideally, we'd also have a metric that indicates when an unclear
>>>> >> leader
>>>> >> >> >> election or a recovery happened. It's a bit tricky because the
>>>> simple
>>>> >> >> >> thing, of tracking it per controller, may be a bit confusing
>>>> during
>>>> >> >> >> failovers.
>>>> >> >> >>
>>>> >> >> >> best,
>>>> >> >> >> Colin
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> On Tue, Sep 12, 2023, at 14:25, Colin McCabe wrote:
>>>> >> >> >> > Hi Calvin,
>>>> >> >> >> >
>>>> >> >> >> > Thanks for the KIP. I think this is a great improvement.
>>>> >> >> >> >
>>>> >> >> >> >> Additional High Watermark advance requirement
>>>> >> >> >> >
>>>> >> >> >> > Typo: change "advance" to "advancement"
>>>> >> >> >> >
>>>> >> >> >> >> A bit recap of some key concepts.
>>>> >> >> >> >
>>>> >> >> >> > Typo: change "bit" to "quick"
>>>> >> >> >> >
>>>> >> >> >> >> Ack=1/all produce request. It defines when the Kafka server
>>>> should
>>>> >> >> >> respond to the produce request
>>>> >> >> >> >
>>>> >> >> >> > I think this section would be clearer if we talked about the
>>>> new
>>>> >> high
>>>> >> >> >> > watermark advancement requirement first, and THEN talked
>>>> about its
>>>> >> >> >> > impact on acks=0, acks=1, and     acks=all.  acks=all is of
>>>> course
>>>> >> the
>>>> >> >> >> > main case we care about here, so it would be good to lead with
>>>> >> that,
>>>> >> >> >> > rather than delving into the technicalities of acks=0/1 first.
>>>> >> >> >> >
>>>> >> >> >> >> Unclean recovery
>>>> >> >> >> >
>>>> >> >> >> > So, here you are introducing a new configuration,
>>>> >> >> >> > unclean.recovery.strategy. The difficult thing here is that
>>>> there
>>>> >> is a
>>>> >> >> >> > lot of overlap with unclean.leader.election.enable. So we
>>>> have 3
>>>> >> >> >> > different settings for unclean.recovery.strategy, plus 2
>>>> different
>>>> >> >> >> > settings for unclean.leader.election.enable, giving a cross
>>>> >> product of
>>>> >> >> >> > 6 different options. The following "unclean recovery manager"
>>>> >> section
>>>> >> >> >> > only applies to one fo those 6 different possibilities (I
>>>> think?)
>>>> >> >> >> >
>>>> >> >> >> > I simply don't think we need so many different election types.
>>>> >> Really
>>>> >> >> >> > the use-cases we need are people who want NO unclean
>>>> elections,
>>>> >> people
>>>> >> >> >> > who want "the reasonable thing" and people who want
>>>> avaialbility at
>>>> >> >> all
>>>> >> >> >> > costs.
>>>> >> >> >> >
>>>> >> >> >> > Overall, I feel like the first half of the KIP is about the
>>>> ELR,
>>>> >> and
>>>> >> >> >> > the second half is about reworking unclean leader election. It
>>>> >> might
>>>> >> >> be
>>>> >> >> >> > better to move that second half to a separate KIP so that we
>>>> can
>>>> >> >> figure
>>>> >> >> >> > it out fully. It should be fine to punt this until later and
>>>> just
>>>> >> have
>>>> >> >> >> > the current behavior on empty ELR be waiting for the last
>>>> known
>>>> >> leader
>>>> >> >> >> > to return. After all, that's what we do today.
>>>> >> >> >> >
>>>> >> >> >> >> DescribeTopicRequest
>>>> >> >> >> >
>>>> >> >> >> > Is the intention for AdminClient to use this RPC for
>>>> >> >> >> > Admin#describeTopics ? If so, we need to describe all of the
>>>> >> changes
>>>> >> >> to
>>>> >> >> >> > the admin client API, as well as changes to command-line
>>>> tools like
>>>> >> >> >> > kafka-topics.sh (if there are any). For example, you will
>>>> probably
>>>> >> >> need
>>>> >> >> >> > changes to TopicDescription.java. You will also need to
>>>> provide
>>>> >> all of
>>>> >> >> >> > the things that admin client needs -- for example,
>>>> >> >> >> > TopicAuthorizedOperations.
>>>> >> >> >> >
>>>> >> >> >> > I also don't think the controller should serve this request.
>>>> We
>>>> >> want
>>>> >> >> to
>>>> >> >> >> > minimize load on the controller. Just like with the other
>>>> metadata
>>>> >> >> >> > requests like MetadataRequest, this should be served by
>>>> brokers.
>>>> >> >> >> >
>>>> >> >> >> > It's a bit confusing why both topic ID and topic name are
>>>> provided
>>>> >> to
>>>> >> >> >> > this API. Is the intention that callers should set one but
>>>> not the
>>>> >> >> >> > other? Or both? This needs to be clarified. Also, if we do
>>>> want to
>>>> >> >> >> > support lookups by UUID, that is another thing that needs to
>>>> be
>>>> >> added
>>>> >> >> >> > to adminclient.
>>>> >> >> >> >
>>>> >> >> >> > In general, I feel like this should also probably be its own
>>>> KIP
>>>> >> since
>>>> >> >> >> > it's fairly complex
>>>> >> >> >> >
>>>> >> >> >> > best,
>>>> >> >> >> > Colin
>>>> >> >> >> >
>>>> >> >> >> >
>>>> >> >> >> > On Thu, Aug 10, 2023, at 15:46, Calvin Liu wrote:
>>>> >> >> >> >> Hi everyone,
>>>> >> >> >> >> I'd like to discuss a series of enhancement to the
>>>> replication
>>>> >> >> protocol.
>>>> >> >> >> >>
>>>> >> >> >> >> A partition replica can experience local data loss in unclean
>>>> >> >> shutdown
>>>> >> >> >> >> scenarios where unflushed data in the OS page cache is lost
>>>> - such
>>>> >> >> as an
>>>> >> >> >> >> availability zone power outage or a server error. The Kafka
>>>> >> >> replication
>>>> >> >> >> >> protocol is designed to handle these situations by removing
>>>> such
>>>> >> >> >> replicas
>>>> >> >> >> >> from the ISR and only re-adding them once they have caught
>>>> up and
>>>> >> >> >> therefore
>>>> >> >> >> >> recovered any lost data. This prevents replicas that lost an
>>>> >> >> arbitrary
>>>> >> >> >> log
>>>> >> >> >> >> suffix, which included committed data, from being elected
>>>> leader.
>>>> >> >> >> >> However, there is a "last replica standing" state which when
>>>> >> combined
>>>> >> >> >> with
>>>> >> >> >> >> a data loss unclean shutdown event can turn a local data loss
>>>> >> >> scenario
>>>> >> >> >> into
>>>> >> >> >> >> a global data loss scenario, i.e., committed data can be
>>>> removed
>>>> >> from
>>>> >> >> >> all
>>>> >> >> >> >> replicas. When the last replica in the ISR experiences an
>>>> unclean
>>>> >> >> >> shutdown
>>>> >> >> >> >> and loses committed data, it will be reelected leader after
>>>> >> starting
>>>> >> >> up
>>>> >> >> >> >> again, causing rejoining followers to truncate their logs and
>>>> >> thereby
>>>> >> >> >> >> removing the last copies of the committed records which the
>>>> leader
>>>> >> >> lost
>>>> >> >> >> >> initially.
>>>> >> >> >> >>
>>>> >> >> >> >> The new KIP will maximize the protection and provides
>>>> MinISR-1
>>>> >> >> >> tolerance to
>>>> >> >> >> >> data loss unclean shutdown events.
>>>> >> >> >> >>
>>>> >> >> >> >>
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>>
>>>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Reply via email to