Catching up with this discussion. I was just curious -- have we had other instances where downgrading MV is not supported? I think Kafka typically tries to support downgrades, and I couldn't think of other examples.
Thanks, Justine On Wed, Oct 4, 2023 at 9:40 AM Calvin Liu <ca...@confluent.io.invalid> wrote: > Hi Jun, > 54. Marked the software downgrading is not supported. As the old controller > will not understand the new PartitionRecord and PartitionChangeRecord. > Thanks! > > On Wed, Oct 4, 2023 at 9:12 AM Jun Rao <j...@confluent.io.invalid> wrote: > > > Hi, Calvin, > > > > Thanks for the reply. Just one more comment. > > > > 54. It seems that downgrading MV is supported. Is downgrading the > software > > version supported? It would be useful to document that. > > > > Thanks, > > > > Jun > > > > On Tue, Oct 3, 2023 at 4:55 PM Artem Livshits > > <alivsh...@confluent.io.invalid> wrote: > > > > > Hi Colin, > > > > > > I think in your example "do_unclean_recovery" would need to do > different > > > things depending on the strategy. > > > > > > do_unclean_recovery() { > > > if (unclean.recovery.manager.enabled) { > > > if (strategy == Aggressive) > > > use UncleanRecoveryManager(waitLastKnownERL=false) // just > inspect > > > logs from whoever is available > > > else > > > use UncleanRecoveryManager(waitLastKnownERL=true) // must wait > > for > > > at least last known ELR > > > } else { > > > if (strategy == Aggressive) > > > choose the last known leader if that is available, or a random > > leader > > > if not) > > > else > > > wait for last known leader to get back > > > } > > > } > > > > > > The idea is that the Aggressive strategy would kick in as soon as we > lost > > > the leader and would pick a leader from whoever is available; but the > > > Balanced will only kick in when ELR is empty and will wait for the > > brokers > > > that likely have most data to be available. > > > > > > On Tue, Oct 3, 2023 at 3:04 PM Colin McCabe <cmcc...@apache.org> > wrote: > > > > > > > On Tue, Oct 3, 2023, at 10:49, Jun Rao wrote: > > > > > Hi, Calvin, > > > > > > > > > > Thanks for the update KIP. A few more comments. > > > > > > > > > > 41. Why would a user choose the option to select a random replica > as > > > the > > > > > leader instead of using unclean.recovery.strateg=Aggressive? It > seems > > > > that > > > > > the latter is strictly better? If that's not the case, could we > fold > > > this > > > > > option under unclean.recovery.strategy instead of introducing a > > > separate > > > > > config? > > > > > > > > Hi Jun, > > > > > > > > I thought the flow of control was: > > > > > > > > If there is no leader for the partition { > > > > If (there are unfenced ELR members) { > > > > choose_an_unfenced_ELR_member > > > > } else if (there are fenced ELR members AND strategy=Aggressive) { > > > > do_unclean_recovery > > > > } else if (there are no ELR members AND strategy != None) { > > > > do_unclean_recovery > > > > } else { > > > > do nothing about the missing leader > > > > } > > > > } > > > > > > > > do_unclean_recovery() { > > > > if (unclean.recovery.manager.enabled) { > > > > use UncleanRecoveryManager > > > > } else { > > > > choose the last known leader if that is available, or a random > > leader > > > > if not) > > > > } > > > > } > > > > > > > > However, I think this could be clarified, especially the behavior > when > > > > unclean.recovery.manager.enabled=false. Inuitively the goal for > > > > unclean.recovery.manager.enabled=false is to be "the same as now, > > mostly" > > > > but it's very underspecified in the KIP, I agree. > > > > > > > > > > > > > > 50. ElectLeadersRequest: "If more than 20 topics are included, only > > the > > > > > first 20 will be served. Others will be returned with > > DesiredLeaders." > > > > Hmm, > > > > > not sure that I understand this. ElectLeadersResponse doesn't have > a > > > > > DesiredLeaders field. > > > > > > > > > > 51. GetReplicaLogInfo: "If more than 2000 partitions are included, > > only > > > > the > > > > > first 2000 will be served" Do we return an error for the remaining > > > > > partitions? Actually, should we include an errorCode field at the > > > > partition > > > > > level in GetReplicaLogInfoResponse to cover non-existing partitions > > and > > > > no > > > > > authorization, etc? > > > > > > > > > > 52. The entry should matches => The entry should match > > > > > > > > > > 53. ElectLeadersRequest.DesiredLeaders: Should it be nullable > since a > > > > user > > > > > may not specify DesiredLeaders? > > > > > > > > > > 54. Downgrade: Is that indeed possible? I thought earlier you said > > that > > > > > once the new version of the records are in the metadata log, one > > can't > > > > > downgrade since the old broker doesn't know how to parse the new > > > version > > > > of > > > > > the metadata records? > > > > > > > > > > > > > MetadataVersion downgrade is currently broken but we have fixing it > on > > > our > > > > plate for Kafka 3.7. > > > > > > > > The way downgrade works is that "new features" are dropped, leaving > > only > > > > the old ones. > > > > > > > > > 55. CleanShutdownFile: Should we add a version field for future > > > > extension? > > > > > > > > > > 56. Config changes are public facing. Could we have a separate > > section > > > to > > > > > document all the config changes? > > > > > > > > +1. A separate section for this would be good. > > > > > > > > best, > > > > Colin > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Mon, Sep 25, 2023 at 4:29 PM Calvin Liu > > <ca...@confluent.io.invalid > > > > > > > > > wrote: > > > > > > > > > >> Hi Jun > > > > >> Thanks for the comments. > > > > >> > > > > >> 40. If we change to None, it is not guaranteed for no data loss. > For > > > > users > > > > >> who are not able to validate the data with external resources, > > manual > > > > >> intervention does not give a better result but a loss of > > availability. > > > > So > > > > >> practically speaking, the Balance mode would be a better default > > > value. > > > > >> > > > > >> 41. No, it represents how we want to do the unclean leader > election. > > > If > > > > it > > > > >> is false, the unclean leader election will be the old random way. > > > > >> Otherwise, the unclean recovery will be used. > > > > >> > > > > >> 42. Good catch. Updated. > > > > >> > > > > >> 43. Only the first 20 topics will be served. Others will be > returned > > > > with > > > > >> InvalidRequestError > > > > >> > > > > >> 44. The order matters. The desired leader entries match with the > > topic > > > > >> partition list by the index. > > > > >> > > > > >> 45. Thanks! Updated. > > > > >> > > > > >> 46. Good advice! Updated. > > > > >> > > > > >> 47.1, updated the comment. Basically it will elect the replica in > > the > > > > >> desiredLeader field to be the leader > > > > >> > > > > >> 47.2 We can let the admin client do the conversion. Using the > > > > desiredLeader > > > > >> field in the json format seems easier for users. > > > > >> > > > > >> 48. Once the MV version is downgraded, all the ELR related fields > > will > > > > be > > > > >> removed on the next partition change. The controller will also > > ignore > > > > the > > > > >> ELR fields. Updated the KIP. > > > > >> > > > > >> 49. Yes, it would be deprecated/removed. > > > > >> > > > > >> > > > > >> On Mon, Sep 25, 2023 at 3:49 PM Jun Rao <j...@confluent.io.invalid > > > > > > wrote: > > > > >> > > > > >> > Hi, Calvin, > > > > >> > > > > > >> > Thanks for the updated KIP. Made another pass. A few more > comments > > > > below. > > > > >> > > > > > >> > 40. unclean.leader.election.enable.false -> > > > > >> > unclean.recovery.strategy.Balanced: The Balanced mode could > still > > > > lead to > > > > >> > data loss. So, I am wondering if > > > unclean.leader.election.enable.false > > > > >> > should map to None? > > > > >> > > > > > >> > 41. unclean.recovery.manager.enabled: I am not sure why we > > introduce > > > > this > > > > >> > additional config. Is it the same as > > unclean.recovery.strategy=None? > > > > >> > > > > > >> > 42. DescribeTopicResponse.TopicAuthorizedOperations: Should this > > be > > > at > > > > >> the > > > > >> > topic level? > > > > >> > > > > > >> > 43. "Limit: 20 topics max per request": Could we describe what > > > > happens if > > > > >> > the request includes more than 20 topics? > > > > >> > > > > > >> > 44. ElectLeadersRequest.DesiredLeaders: Could we describe > whether > > > the > > > > >> > ordering matters? > > > > >> > > > > > >> > 45. GetReplicaLogInfo.TopicPartitions: "about": "The topic > > > partitions > > > > to > > > > >> > elect leaders.": The description in "about" is incorrect. > > > > >> > > > > > >> > 46. GetReplicaLogInfoResponse: Should we nest partitions under > > > > topicId to > > > > >> > be consistent with other types of responses? > > > > >> > > > > > >> > 47. kafka-leader-election.sh: > > > > >> > 47.1 Could we explain DESIGNATION? > > > > >> > 47.2 desiredLeader: Should it be a list to match the field in > > > > >> > ElectLeadersRequest? > > > > >> > > > > > >> > 48. We could add a section on downgrade? > > > > >> > > > > > >> > 49. LastKnownLeader: This seems only needed in the first phase > of > > > > >> > delivering ELR. Will it be removed when the complete KIP is > > > delivered? > > > > >> > > > > > >> > Thanks, > > > > >> > > > > > >> > Jun > > > > >> > > > > > >> > On Tue, Sep 19, 2023 at 1:30 PM Colin McCabe < > cmcc...@apache.org> > > > > wrote: > > > > >> > > > > > >> > > Hi Calvin, > > > > >> > > > > > > >> > > Thanks for the explanations. I like the idea of using none, > > > > balanced, > > > > >> > > aggressive. We also had an offline discussion about why it is > > good > > > > to > > > > >> > use a > > > > >> > > new config key (basically, so that we can deprecate the old > one > > > > which > > > > >> had > > > > >> > > only false/true values in 4.0) With these changes, I am +1. > > > > >> > > > > > > >> > > best, > > > > >> > > Colin > > > > >> > > > > > > >> > > On Mon, Sep 18, 2023, at 15:54, Calvin Liu wrote: > > > > >> > > > Hi Colin, > > > > >> > > > Also, can we deprecate unclean.leader.election.enable in > 4.0? > > > > Before > > > > >> > > that, > > > > >> > > > we can have both the config unclean.recovery.strategy and > > > > >> > > > unclean.leader.election.enable > > > > >> > > > and using the unclean.recovery.Enabled to determine which > > config > > > > to > > > > >> use > > > > >> > > > during the unclean leader election. > > > > >> > > > > > > > >> > > > On Mon, Sep 18, 2023 at 3:51 PM Calvin Liu < > > ca...@confluent.io> > > > > >> wrote: > > > > >> > > > > > > > >> > > >> Hi Colin, > > > > >> > > >> For the unclean.recovery.strategy config name, how about we > > use > > > > the > > > > >> > > >> following > > > > >> > > >> None. It basically means no unclean recovery will be > > performed. > > > > >> > > >> Aggressive. It means availability goes first. Whenever the > > > > partition > > > > >> > > can't > > > > >> > > >> elect a durable replica, the controller will try the > unclean > > > > >> recovery. > > > > >> > > >> Balanced. It is the balance point of the availability > > > > >> > first(Aggressive) > > > > >> > > >> and least availability(None). The controller performs > unclean > > > > >> recovery > > > > >> > > when > > > > >> > > >> both ISR and ELR are empty. > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> On Fri, Sep 15, 2023 at 11:42 AM Calvin Liu < > > > ca...@confluent.io> > > > > >> > wrote: > > > > >> > > >> > > > > >> > > >>> Hi Colin, > > > > >> > > >>> > > > > >> > > >>> > So, the proposal is that if someone sets > > > > >> > > "unclean.leader.election.enable > > > > >> > > >>> = true"... > > > > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> The idea is to use one of the > unclean.leader.election.enable > > > and > > > > >> > > >>> unclean.recovery.strategy based on the > > > > unclean.recovery.Enabled. A > > > > >> > > possible > > > > >> > > >>> version can be > > > > >> > > >>> > > > > >> > > >>> If unclean.recovery.Enabled: > > > > >> > > >>> > > > > >> > > >>> { > > > > >> > > >>> > > > > >> > > >>> Check unclean.recovery.strategy. If set, use it. > Otherwise, > > > > check > > > > >> > > >>> unclean.leader.election.enable and translate it to > > > > >> > > >>> unclean.recovery.strategy. > > > > >> > > >>> > > > > >> > > >>> } else { > > > > >> > > >>> > > > > >> > > >>> Use unclean.leader.election.enable > > > > >> > > >>> > > > > >> > > >>> } > > > > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> —-------- > > > > >> > > >>> > > > > >> > > >>> >The configuration key should be > > > > >> "unclean.recovery.manager.enabled", > > > > >> > > >>> right? > > > > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> I think we have two ways of choosing a leader uncleanly, > > > unclean > > > > >> > leader > > > > >> > > >>> election and unclean recovery(log inspection) and we try > to > > > > switch > > > > >> > > between > > > > >> > > >>> them. > > > > >> > > >>> > > > > >> > > >>> Do you mean we want to develop two ways of performing the > > > > unclean > > > > >> > > >>> recovery and one of them is using “unclean recovery > > manager”? > > > I > > > > >> guess > > > > >> > > we > > > > >> > > >>> haven’t discussed the second way. > > > > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> —------- > > > > >> > > >>> > > > > >> > > >>> >How do these 4 levels of overrides interact with your new > > > > >> > > >>> configurations? > > > > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> I do notice in the Kraft controller code, the method to > > check > > > > >> whether > > > > >> > > >>> perform unclean leader election is hard coded to false > since > > > > >> > > >>> 2021(uncleanLeaderElectionEnabledForTopic). Isn’t it a > good > > > > chance > > > > >> to > > > > >> > > >>> completely deprecate the unclean.leader.election.enable? > We > > > > don’t > > > > >> > even > > > > >> > > have > > > > >> > > >>> to worry about the config conversion. > > > > >> > > >>> > > > > >> > > >>> On the other hand, whatever the override is, as long as > the > > > > >> > controller > > > > >> > > >>> can have the final effective > unclean.leader.election.enable, > > > the > > > > >> > topic > > > > >> > > >>> level config unclean.recovery.strategy, the cluster level > > > config > > > > >> > > >>> unclean.recovery.Enabled, the controller can calculate the > > > > correct > > > > >> > > methods > > > > >> > > >>> to use right? > > > > >> > > >>> > > > > >> > > >>> > > > > >> > > >>> On Fri, Sep 15, 2023 at 10:02 AM Colin McCabe < > > > > cmcc...@apache.org> > > > > >> > > wrote: > > > > >> > > >>> > > > > >> > > >>>> On Thu, Sep 14, 2023, at 22:23, Calvin Liu wrote: > > > > >> > > >>>> > Hi Colin > > > > >> > > >>>> > 1. I think using the new config name is more clear. > > > > >> > > >>>> > a. The unclean leader election is actually > removed > > > if > > > > >> > unclean > > > > >> > > >>>> > recovery is in use. > > > > >> > > >>>> > b. Using multiple values in > > > > >> unclean.leader.election.enable > > > > >> > is > > > > >> > > >>>> > confusing and it will be more confusing after people > > forget > > > > >> about > > > > >> > > this > > > > >> > > >>>> > discussion. > > > > >> > > >>>> > > > > >> > > >>>> Hi Calvin, > > > > >> > > >>>> > > > > >> > > >>>> So, the proposal is that if someone sets > > > > >> > > "unclean.leader.election.enable > > > > >> > > >>>> = true" but then sets one of your new configurations, the > > > > value of > > > > >> > > >>>> unclean.leader.election.enable is ignored? That seems > less > > > > clear > > > > >> to > > > > >> > > me, not > > > > >> > > >>>> more. Just in general, having multiple configuration keys > > to > > > > >> control > > > > >> > > the > > > > >> > > >>>> same thing confuses users. Basically, they are sitting > at a > > > > giant > > > > >> > > control > > > > >> > > >>>> panel, and some of the levers do nothing. > > > > >> > > >>>> > > > > >> > > >>>> > 2. Sorry I forgot to mention in the response that I did > > add > > > > the > > > > >> > > >>>> > unclean.recovery.Enabled flag. > > > > >> > > >>>> > > > > >> > > >>>> The configuration key should be > > > > >> "unclean.recovery.manager.enabled", > > > > >> > > >>>> right? Becuase we can do "unclean recovery" without the > > > > manager. > > > > >> > > Disabling > > > > >> > > >>>> the manager just means we use a different mechanism for > > > > recovery. > > > > >> > > >>>> > > > > >> > > >>>> > c. Maybe I underestimated the challenge of > > replacing > > > > the > > > > >> > > >>>> config. Any > > > > >> > > >>>> > implementation problems ahead? > > > > >> > > >>>> > > > > >> > > >>>> There are four levels of overrides for > > > > >> > unclean.leader.election.enable. > > > > >> > > >>>> > > > > >> > > >>>> 1. static configuration for node. > > > > >> > > >>>> This goes in the configuration file, typically named > > > > >> > > >>>> server.properties > > > > >> > > >>>> > > > > >> > > >>>> 2. dynamic configuration for node default > > > > >> > > >>>> ConfigResource(type=BROKER, name="") > > > > >> > > >>>> > > > > >> > > >>>> 3. dynamic configuration for node > > > > >> > > >>>> ConfigResource(type=BROKER, name=<controller id>) > > > > >> > > >>>> > > > > >> > > >>>> 4. dynamic configuration for topic > > > > >> > > >>>> ConfigResource(type=TOPIC, name=<topic-name>) > > > > >> > > >>>> > > > > >> > > >>>> How do these 4 levels of overrides interact with your new > > > > >> > > >>>> configurations? If the new configurations dominate over > the > > > old > > > > >> > ones, > > > > >> > > it > > > > >> > > >>>> seems like this will get a lot more confusing to > implement > > > (and > > > > >> also > > > > >> > > to > > > > >> > > >>>> use.) > > > > >> > > >>>> > > > > >> > > >>>> Again, I'd recommend just adding some new values to > > > > >> > > >>>> unclean.leader.election.enable. It's simple and will > > prevent > > > > user > > > > >> > > confusion > > > > >> > > >>>> (as well as developer confusion.) > > > > >> > > >>>> > > > > >> > > >>>> best, > > > > >> > > >>>> Colin > > > > >> > > >>>> > > > > >> > > >>>> > > > > >> > > >>>> > 3. About the admin client, I mentioned 3 changes in the > > > > client. > > > > >> > > >>>> Anything > > > > >> > > >>>> > else I missed in the KIP? > > > > >> > > >>>> > a. The client will switch to using the new RPC > > > instead > > > > of > > > > >> > > >>>> > MetadataRequest for the topics. > > > > >> > > >>>> > b. The TopicPartitionInfo used in > TopicDescription > > > > needs > > > > >> to > > > > >> > > add > > > > >> > > >>>> new > > > > >> > > >>>> > fields related to the ELR. > > > > >> > > >>>> > c. The outputs will add the ELR related fields. > > > > >> > > >>>> > > > > > >> > > >>>> > On Thu, Sep 14, 2023 at 9:19 PM Colin McCabe < > > > > >> cmcc...@apache.org> > > > > >> > > >>>> wrote: > > > > >> > > >>>> > > > > > >> > > >>>> >> Hi Calvin, > > > > >> > > >>>> >> > > > > >> > > >>>> >> Thanks for the changes. > > > > >> > > >>>> >> > > > > >> > > >>>> >> 1. Earlier I commented that creating > > > > >> "unclean.recovery.strategy " > > > > >> > > is > > > > >> > > >>>> not > > > > >> > > >>>> >> necessary, and we can just reuse the existing > > > > >> > > >>>> >> "unclean.leader.election.enable" configuration key. > > Let's > > > > >> discuss > > > > >> > > >>>> that. > > > > >> > > >>>> >> > > > > >> > > >>>> >> 2.I also don't understand why you didn't add a > > > > configuration to > > > > >> > > >>>> enable or > > > > >> > > >>>> >> disable the Unclean Recovery Manager. This seems like > a > > > very > > > > >> > simple > > > > >> > > >>>> way to > > > > >> > > >>>> >> handle the staging issue which we discussed. The URM > can > > > > just > > > > >> be > > > > >> > > >>>> turned off > > > > >> > > >>>> >> until it is production ready. Let's discuss this. > > > > >> > > >>>> >> > > > > >> > > >>>> >> 3. You still need to describe the changes to > AdminClient > > > > that > > > > >> are > > > > >> > > >>>> needed > > > > >> > > >>>> >> to use DescribeTopicRequest. > > > > >> > > >>>> >> > > > > >> > > >>>> >> Keep at it. It's looking better. :) > > > > >> > > >>>> >> > > > > >> > > >>>> >> best, > > > > >> > > >>>> >> Colin > > > > >> > > >>>> >> > > > > >> > > >>>> >> > > > > >> > > >>>> >> On Thu, Sep 14, 2023, at 11:03, Calvin Liu wrote: > > > > >> > > >>>> >> > Hi Colin > > > > >> > > >>>> >> > Thanks for the comments! > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > I did the following changes > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > 1. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > Simplified the API spec section to only include > the > > > > diff. > > > > >> > > >>>> >> > 2. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > Reordered the HWM requirement section. > > > > >> > > >>>> >> > 3. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > Removed the URM implementation details to keep > the > > > > >> necessary > > > > >> > > >>>> >> > characteristics to perform the unclean recovery. > > > > >> > > >>>> >> > 1. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > When to perform the unclean recovery > > > > >> > > >>>> >> > 2. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > Under different config, how the unclean > recovery > > > > finds > > > > >> > the > > > > >> > > >>>> leader. > > > > >> > > >>>> >> > 3. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > How the config unclean.leader.election.enable > > and > > > > >> > > >>>> >> > unclean.recovery.strategy are converted when > > users > > > > >> > > >>>> enable/disable > > > > >> > > >>>> >> the > > > > >> > > >>>> >> > unclean recovery. > > > > >> > > >>>> >> > 4. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > More details about how we change admin client. > > > > >> > > >>>> >> > 5. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > API limits on the GetReplicaLogInfoRequest and > > > > >> > > >>>> DescribeTopicRequest. > > > > >> > > >>>> >> > 6. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > Two metrics added > > > > >> > > >>>> >> > 1. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > > > > > Kafka.controller.global_under_min_isr_partition_count > > > > >> > > >>>> >> > 2. > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > > kafka.controller.unclean_recovery_finished_count > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > > > > > >> > > >>>> >> > On Wed, Sep 13, 2023 at 10:46 AM Colin McCabe < > > > > >> > > cmcc...@apache.org> > > > > >> > > >>>> >> wrote: > > > > >> > > >>>> >> > > > > > >> > > >>>> >> >> On Tue, Sep 12, 2023, at 17:21, Calvin Liu wrote: > > > > >> > > >>>> >> >> > Hi Colin > > > > >> > > >>>> >> >> > Thanks for the comments! > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Hi Calvin, > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Thanks again for the KIP. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> One meta-comment: it's usually better to just do a > > diff > > > > on a > > > > >> > > >>>> message > > > > >> > > >>>> >> spec > > > > >> > > >>>> >> >> file or java file if you're including changes to it > > in > > > > the > > > > >> > KIP. > > > > >> > > >>>> This is > > > > >> > > >>>> >> >> easier to read than looking for "new fields begin" > > etc. > > > > in > > > > >> the > > > > >> > > >>>> text, and > > > > >> > > >>>> >> >> gracefully handles the case where existing fields > > were > > > > >> > changed. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > Rewrite the Additional High Watermark advancement > > > > >> > requirement > > > > >> > > >>>> >> >> > There was feedback on this section that some > > readers > > > > may > > > > >> not > > > > >> > > be > > > > >> > > >>>> >> familiar > > > > >> > > >>>> >> >> > with HWM and Ack=0,1,all requests. This can help > > them > > > > >> > > understand > > > > >> > > >>>> the > > > > >> > > >>>> >> >> > proposal. I will rewrite this part for more > > > > readability. > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> To be clear, I wasn't suggesting dropping either > > > > section. I > > > > >> > > agree > > > > >> > > >>>> that > > > > >> > > >>>> >> >> they add useful background. I was just suggesting > > that > > > we > > > > >> > should > > > > >> > > >>>> discuss > > > > >> > > >>>> >> >> the "acks" setting AFTER discussing the new high > > > > watermark > > > > >> > > >>>> advancement > > > > >> > > >>>> >> >> conditions. We also should discuss acks=0. While it > > > isn't > > > > >> > > >>>> conceptually > > > > >> > > >>>> >> much > > > > >> > > >>>> >> >> different than acks=1 here, its omission from this > > > > section > > > > >> is > > > > >> > > >>>> confusing. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > Unclean recovery > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > The plan is to replace the > > > > unclean.leader.election.enable > > > > >> > with > > > > >> > > >>>> >> >> > unclean.recovery.strategy. If the Unclean > Recovery > > is > > > > >> > enabled > > > > >> > > >>>> then it > > > > >> > > >>>> >> >> deals > > > > >> > > >>>> >> >> > with the three options in the > > > > unclean.recovery.strategy. > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > Let’s refine the Unclean Recovery. We have > already > > > > taken a > > > > >> > > lot of > > > > >> > > >>>> >> >> > suggestions and I hope to enhance the durability > of > > > > Kafka > > > > >> to > > > > >> > > the > > > > >> > > >>>> next > > > > >> > > >>>> >> >> level > > > > >> > > >>>> >> >> > with this KIP. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> I am OK with doing the unclean leader recovery > > > > improvements > > > > >> in > > > > >> > > >>>> this KIP. > > > > >> > > >>>> >> >> However, I think we need to really work on the > > > > configuration > > > > >> > > >>>> settings. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Configuration overrides are often quite messy. For > > > > example, > > > > >> > the > > > > >> > > >>>> cases > > > > >> > > >>>> >> >> where we have log.roll.hours and > log.roll.segment.ms > > , > > > > the > > > > >> > user > > > > >> > > >>>> has to > > > > >> > > >>>> >> >> remember which one takes precedence, and it is not > > > > obvious. > > > > >> > So, > > > > >> > > >>>> rather > > > > >> > > >>>> >> than > > > > >> > > >>>> >> >> creating a new configuration, why not add > additional > > > > values > > > > >> to > > > > >> > > >>>> >> >> "unclean.leader.election.enable"? I think this will > > be > > > > >> simpler > > > > >> > > for > > > > >> > > >>>> >> people > > > > >> > > >>>> >> >> to understand, and simpler in the code as well. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> What if we continued to use > > > > "unclean.leader.election.enable" > > > > >> > but > > > > >> > > >>>> >> extended > > > > >> > > >>>> >> >> it so that it took a string? Then the string could > > have > > > > >> these > > > > >> > > >>>> values: > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> never > > > > >> > > >>>> >> >> never automatically do an unclean leader > election > > > > under > > > > >> > any > > > > >> > > >>>> >> conditions > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> false / default > > > > >> > > >>>> >> >> only do an unclean leader election if there may > > be > > > > >> > possible > > > > >> > > >>>> data > > > > >> > > >>>> >> loss > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> true / always > > > > >> > > >>>> >> >> always do an unclean leader election if we > can't > > > > >> > immediately > > > > >> > > >>>> elect a > > > > >> > > >>>> >> >> leader > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> It's a bit awkward that false maps to default > rather > > > > than to > > > > >> > > >>>> never. But > > > > >> > > >>>> >> >> this awkwardness exists if we use two different > > > > >> configuration > > > > >> > > keys > > > > >> > > >>>> as > > > > >> > > >>>> >> well. > > > > >> > > >>>> >> >> The reason for the awkwardness is that we simply > > don't > > > > want > > > > >> > most > > > > >> > > >>>> of the > > > > >> > > >>>> >> >> people currently setting > > > > >> unclean.leader.election.enable=false > > > > >> > to > > > > >> > > >>>> get the > > > > >> > > >>>> >> >> "never" behavior. We have to bite that bullet. > Better > > > to > > > > be > > > > >> > > clear > > > > >> > > >>>> and > > > > >> > > >>>> >> >> explicit than hide it. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Another thing that's a bit awkward is having two > > > > different > > > > >> > ways > > > > >> > > to > > > > >> > > >>>> do > > > > >> > > >>>> >> >> unclean leader election specified in the KIP. You > > > > descirbe > > > > >> two > > > > >> > > >>>> methods: > > > > >> > > >>>> >> the > > > > >> > > >>>> >> >> simple "choose the last leader" method, and the > > > "unclean > > > > >> > > recovery > > > > >> > > >>>> >> manager" > > > > >> > > >>>> >> >> method. I understand why you did it this way -- > > "choose > > > > the > > > > >> > last > > > > >> > > >>>> >> leader" is > > > > >> > > >>>> >> >> simple, and will help us deliver an implementation > > > > quickly, > > > > >> > > while > > > > >> > > >>>> the > > > > >> > > >>>> >> URM > > > > >> > > >>>> >> >> is preferable in the long term. My suggestion here > is > > > to > > > > >> > > separate > > > > >> > > >>>> the > > > > >> > > >>>> >> >> decision of HOW to do unclean leader election from > > the > > > > >> > decision > > > > >> > > of > > > > >> > > >>>> WHEN > > > > >> > > >>>> >> to > > > > >> > > >>>> >> >> do it. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> So in other words, have > > > "unclean.leader.election.enable" > > > > >> > specify > > > > >> > > >>>> when we > > > > >> > > >>>> >> >> do unclean leader election, and have a new > > > configuration > > > > >> like > > > > >> > > >>>> >> >> "unclean.recovery.manager.enable" to determine if > we > > > use > > > > the > > > > >> > > URM. > > > > >> > > >>>> >> >> Presumably the URM will take some time to get fully > > > > stable, > > > > >> so > > > > >> > > >>>> this can > > > > >> > > >>>> >> >> default to false for a while, and we can flip the > > > > default to > > > > >> > > true > > > > >> > > >>>> when > > > > >> > > >>>> >> we > > > > >> > > >>>> >> >> feel ready. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> The URM is somewhat under-described here. I think > we > > > > need a > > > > >> > few > > > > >> > > >>>> >> >> configurations here for it. For example, we need a > > > > >> > > configuration to > > > > >> > > >>>> >> specify > > > > >> > > >>>> >> >> how long it should wait for a broker to respond to > > its > > > > RPCs > > > > >> > > before > > > > >> > > >>>> >> moving > > > > >> > > >>>> >> >> on. We also need to understand how the URM > interacts > > > with > > > > >> > > >>>> >> >> unclean.leader.election.enable=always. I assume > that > > > with > > > > >> > > "always" > > > > >> > > >>>> we > > > > >> > > >>>> >> will > > > > >> > > >>>> >> >> just unconditionally use the URM rather than > choosing > > > > >> > randomly. > > > > >> > > >>>> But this > > > > >> > > >>>> >> >> should be spelled out in the KIP. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > DescribeTopicRequest > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > 1. > > > > >> > > >>>> >> >> > Yes, the plan is to replace the > MetadataRequest > > > with > > > > >> the > > > > >> > > >>>> >> >> > DescribeTopicRequest for the admin clients. > Will > > > > check > > > > >> > the > > > > >> > > >>>> details. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Sounds good. But as I said, you need to specify how > > > > >> > AdminClient > > > > >> > > >>>> >> interacts > > > > >> > > >>>> >> >> with the new request. This will involve adding some > > > > fields > > > > >> to > > > > >> > > >>>> >> >> TopicDescription.java. And you need to specify the > > > > changes > > > > >> to > > > > >> > > the > > > > >> > > >>>> >> >> kafka-topics.sh command line tool. Otherwise we > > cannot > > > > use > > > > >> the > > > > >> > > >>>> tool to > > > > >> > > >>>> >> see > > > > >> > > >>>> >> >> the new information. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> The new requests, DescribeTopicRequest and > > > > >> > > >>>> GetReplicaLogInfoRequest, > > > > >> > > >>>> >> need > > > > >> > > >>>> >> >> to have limits placed on them so that their size > > can't > > > be > > > > >> > > >>>> infinite. We > > > > >> > > >>>> >> >> don't want to propagate the current problems of > > > > >> > MetadataRequest, > > > > >> > > >>>> where > > > > >> > > >>>> >> >> clients can request massive responses that can mess > > up > > > > the > > > > >> JVM > > > > >> > > when > > > > >> > > >>>> >> handled. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Adding limits is simple for > GetReplicaLogInfoRequest > > -- > > > > we > > > > >> can > > > > >> > > >>>> just say > > > > >> > > >>>> >> >> that only 2000 partitions at a time can be > requested. > > > For > > > > >> > > >>>> >> >> DescribeTopicRequest we can probably just limit to > 20 > > > > topics > > > > >> > or > > > > >> > > >>>> >> something > > > > >> > > >>>> >> >> like that, to avoid the complexity of doing > > pagination > > > in > > > > >> this > > > > >> > > KIP. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > 2. > > > > >> > > >>>> >> >> > I can let the broker load the ELR info so that > > > they > > > > can > > > > >> > > serve > > > > >> > > >>>> the > > > > >> > > >>>> >> >> > DescribeTopicRequest as well. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> Yes, it's fine to add to MetadataCache. In fact, > > you'll > > > > be > > > > >> > > loading > > > > >> > > >>>> it > > > > >> > > >>>> >> >> anyway once it's added to PartitionImage. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > 3. > > > > >> > > >>>> >> >> > Yeah, it does not make sense to have the topic > > id > > > if > > > > >> > > >>>> >> >> > DescribeTopicRequest is only used by the admin > > > > client. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> OK. That makes things simpler. We can always > create a > > > new > > > > >> API > > > > >> > > later > > > > >> > > >>>> >> >> (hopefully not in this KIP!) to query by topic ID. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > Metrics > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > As for overall cluster health metrics, I think > > > > >> under-min-ISR > > > > >> > > is > > > > >> > > >>>> still > > > > >> > > >>>> >> a > > > > >> > > >>>> >> >> > useful one. ELR is more like a safety belt. When > > the > > > > ELR > > > > >> is > > > > >> > > >>>> used, the > > > > >> > > >>>> >> >> > cluster availability has already been impacted. > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > Maybe we can have a metric to count the > partitions > > > that > > > > >> > > sum(ISR, > > > > >> > > >>>> ELR) > > > > >> > > >>>> >> < > > > > >> > > >>>> >> >> min > > > > >> > > >>>> >> >> > ISR. What do you think? > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> How about: > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> A. a metric for the totoal number of under-min-isr > > > > >> > partitions? > > > > >> > > We > > > > >> > > >>>> don't > > > > >> > > >>>> >> >> have that in Apache Kafka at the moment. > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> B. a metric for the number of unclean leader > > elections > > > we > > > > >> did > > > > >> > > (for > > > > >> > > >>>> >> >> simplicity, it can reset to 0 on controller > restart: > > we > > > > >> expect > > > > >> > > >>>> people to > > > > >> > > >>>> >> >> monitor the change over time anyway) > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> best, > > > > >> > > >>>> >> >> Colin > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > Yeah, for the ongoing unclean recoveries, the > > > > controller > > > > >> can > > > > >> > > >>>> keep an > > > > >> > > >>>> >> >> > accurate count through failover because partition > > > > >> > registration > > > > >> > > >>>> can > > > > >> > > >>>> >> >> indicate > > > > >> > > >>>> >> >> > whether a recovery is needed. However, for the > > > happened > > > > >> > ones, > > > > >> > > >>>> unless > > > > >> > > >>>> >> we > > > > >> > > >>>> >> >> > want to persist the number somewhere, we can only > > > > figure > > > > >> it > > > > >> > > out > > > > >> > > >>>> from > > > > >> > > >>>> >> the > > > > >> > > >>>> >> >> > log. > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> > On Tue, Sep 12, 2023 at 3:16 PM Colin McCabe < > > > > >> > > cmcc...@apache.org > > > > >> > > >>>> > > > > > >> > > >>>> >> wrote: > > > > >> > > >>>> >> >> > > > > > >> > > >>>> >> >> >> Also, we should have metrics that show what is > > going > > > > on > > > > >> > with > > > > >> > > >>>> regard > > > > >> > > >>>> >> to > > > > >> > > >>>> >> >> the > > > > >> > > >>>> >> >> >> eligible replica set. I'm not sure exactly what > to > > > > >> suggest, > > > > >> > > but > > > > >> > > >>>> >> >> something > > > > >> > > >>>> >> >> >> that could identify when things are going wrong > in > > > the > > > > >> > > clsuter. > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> >> For example, maybe a metric for partitions > > > containing > > > > >> > > replicas > > > > >> > > >>>> that > > > > >> > > >>>> >> are > > > > >> > > >>>> >> >> >> ineligible to be leader? That would show a spike > > > when > > > > a > > > > >> > > broker > > > > >> > > >>>> had an > > > > >> > > >>>> >> >> >> unclean restart. > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> >> Ideally, we'd also have a metric that indicates > > when > > > > an > > > > >> > > unclear > > > > >> > > >>>> >> leader > > > > >> > > >>>> >> >> >> election or a recovery happened. It's a bit > tricky > > > > >> because > > > > >> > > the > > > > >> > > >>>> simple > > > > >> > > >>>> >> >> >> thing, of tracking it per controller, may be a > bit > > > > >> > confusing > > > > >> > > >>>> during > > > > >> > > >>>> >> >> >> failovers. > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> >> best, > > > > >> > > >>>> >> >> >> Colin > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> >> On Tue, Sep 12, 2023, at 14:25, Colin McCabe > > wrote: > > > > >> > > >>>> >> >> >> > Hi Calvin, > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > Thanks for the KIP. I think this is a great > > > > >> improvement. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> >> Additional High Watermark advance requirement > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > Typo: change "advance" to "advancement" > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> >> A bit recap of some key concepts. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > Typo: change "bit" to "quick" > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> >> Ack=1/all produce request. It defines when > the > > > > Kafka > > > > >> > > server > > > > >> > > >>>> should > > > > >> > > >>>> >> >> >> respond to the produce request > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > I think this section would be clearer if we > > talked > > > > >> about > > > > >> > > the > > > > >> > > >>>> new > > > > >> > > >>>> >> high > > > > >> > > >>>> >> >> >> > watermark advancement requirement first, and > > THEN > > > > >> talked > > > > >> > > >>>> about its > > > > >> > > >>>> >> >> >> > impact on acks=0, acks=1, and acks=all. > > > > acks=all > > > > >> is > > > > >> > of > > > > >> > > >>>> course > > > > >> > > >>>> >> the > > > > >> > > >>>> >> >> >> > main case we care about here, so it would be > > good > > > to > > > > >> lead > > > > >> > > with > > > > >> > > >>>> >> that, > > > > >> > > >>>> >> >> >> > rather than delving into the technicalities of > > > > acks=0/1 > > > > >> > > first. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> >> Unclean recovery > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > So, here you are introducing a new > > configuration, > > > > >> > > >>>> >> >> >> > unclean.recovery.strategy. The difficult thing > > > here > > > > is > > > > >> > that > > > > >> > > >>>> there > > > > >> > > >>>> >> is a > > > > >> > > >>>> >> >> >> > lot of overlap with > > > unclean.leader.election.enable. > > > > So > > > > >> we > > > > >> > > >>>> have 3 > > > > >> > > >>>> >> >> >> > different settings for > > unclean.recovery.strategy, > > > > plus > > > > >> 2 > > > > >> > > >>>> different > > > > >> > > >>>> >> >> >> > settings for unclean.leader.election.enable, > > > giving > > > > a > > > > >> > cross > > > > >> > > >>>> >> product of > > > > >> > > >>>> >> >> >> > 6 different options. The following "unclean > > > recovery > > > > >> > > manager" > > > > >> > > >>>> >> section > > > > >> > > >>>> >> >> >> > only applies to one fo those 6 different > > > > possibilities > > > > >> (I > > > > >> > > >>>> think?) > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > I simply don't think we need so many different > > > > election > > > > >> > > types. > > > > >> > > >>>> >> Really > > > > >> > > >>>> >> >> >> > the use-cases we need are people who want NO > > > unclean > > > > >> > > >>>> elections, > > > > >> > > >>>> >> people > > > > >> > > >>>> >> >> >> > who want "the reasonable thing" and people who > > > want > > > > >> > > >>>> avaialbility at > > > > >> > > >>>> >> >> all > > > > >> > > >>>> >> >> >> > costs. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > Overall, I feel like the first half of the KIP > > is > > > > about > > > > >> > the > > > > >> > > >>>> ELR, > > > > >> > > >>>> >> and > > > > >> > > >>>> >> >> >> > the second half is about reworking unclean > > leader > > > > >> > > election. It > > > > >> > > >>>> >> might > > > > >> > > >>>> >> >> be > > > > >> > > >>>> >> >> >> > better to move that second half to a separate > > KIP > > > so > > > > >> that > > > > >> > > we > > > > >> > > >>>> can > > > > >> > > >>>> >> >> figure > > > > >> > > >>>> >> >> >> > it out fully. It should be fine to punt this > > until > > > > >> later > > > > >> > > and > > > > >> > > >>>> just > > > > >> > > >>>> >> have > > > > >> > > >>>> >> >> >> > the current behavior on empty ELR be waiting > for > > > the > > > > >> last > > > > >> > > >>>> known > > > > >> > > >>>> >> leader > > > > >> > > >>>> >> >> >> > to return. After all, that's what we do today. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> >> DescribeTopicRequest > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > Is the intention for AdminClient to use this > RPC > > > for > > > > >> > > >>>> >> >> >> > Admin#describeTopics ? If so, we need to > > describe > > > > all > > > > >> of > > > > >> > > the > > > > >> > > >>>> >> changes > > > > >> > > >>>> >> >> to > > > > >> > > >>>> >> >> >> > the admin client API, as well as changes to > > > > >> command-line > > > > >> > > >>>> tools like > > > > >> > > >>>> >> >> >> > kafka-topics.sh (if there are any). For > example, > > > you > > > > >> will > > > > >> > > >>>> probably > > > > >> > > >>>> >> >> need > > > > >> > > >>>> >> >> >> > changes to TopicDescription.java. You will > also > > > > need to > > > > >> > > >>>> provide > > > > >> > > >>>> >> all of > > > > >> > > >>>> >> >> >> > the things that admin client needs -- for > > example, > > > > >> > > >>>> >> >> >> > TopicAuthorizedOperations. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > I also don't think the controller should serve > > > this > > > > >> > > request. > > > > >> > > >>>> We > > > > >> > > >>>> >> want > > > > >> > > >>>> >> >> to > > > > >> > > >>>> >> >> >> > minimize load on the controller. Just like > with > > > the > > > > >> other > > > > >> > > >>>> metadata > > > > >> > > >>>> >> >> >> > requests like MetadataRequest, this should be > > > > served by > > > > >> > > >>>> brokers. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > It's a bit confusing why both topic ID and > topic > > > > name > > > > >> are > > > > >> > > >>>> provided > > > > >> > > >>>> >> to > > > > >> > > >>>> >> >> >> > this API. Is the intention that callers should > > set > > > > one > > > > >> > but > > > > >> > > >>>> not the > > > > >> > > >>>> >> >> >> > other? Or both? This needs to be clarified. > > Also, > > > > if we > > > > >> > do > > > > >> > > >>>> want to > > > > >> > > >>>> >> >> >> > support lookups by UUID, that is another thing > > > that > > > > >> needs > > > > >> > > to > > > > >> > > >>>> be > > > > >> > > >>>> >> added > > > > >> > > >>>> >> >> >> > to adminclient. > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > In general, I feel like this should also > > probably > > > be > > > > >> its > > > > >> > > own > > > > >> > > >>>> KIP > > > > >> > > >>>> >> since > > > > >> > > >>>> >> >> >> > it's fairly complex > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > best, > > > > >> > > >>>> >> >> >> > Colin > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > > > > > >> > > >>>> >> >> >> > On Thu, Aug 10, 2023, at 15:46, Calvin Liu > > wrote: > > > > >> > > >>>> >> >> >> >> Hi everyone, > > > > >> > > >>>> >> >> >> >> I'd like to discuss a series of enhancement > to > > > the > > > > >> > > >>>> replication > > > > >> > > >>>> >> >> protocol. > > > > >> > > >>>> >> >> >> >> > > > > >> > > >>>> >> >> >> >> A partition replica can experience local data > > > loss > > > > in > > > > >> > > unclean > > > > >> > > >>>> >> >> shutdown > > > > >> > > >>>> >> >> >> >> scenarios where unflushed data in the OS page > > > > cache is > > > > >> > > lost > > > > >> > > >>>> - such > > > > >> > > >>>> >> >> as an > > > > >> > > >>>> >> >> >> >> availability zone power outage or a server > > error. > > > > The > > > > >> > > Kafka > > > > >> > > >>>> >> >> replication > > > > >> > > >>>> >> >> >> >> protocol is designed to handle these > situations > > > by > > > > >> > > removing > > > > >> > > >>>> such > > > > >> > > >>>> >> >> >> replicas > > > > >> > > >>>> >> >> >> >> from the ISR and only re-adding them once > they > > > have > > > > >> > caught > > > > >> > > >>>> up and > > > > >> > > >>>> >> >> >> therefore > > > > >> > > >>>> >> >> >> >> recovered any lost data. This prevents > replicas > > > > that > > > > >> > lost > > > > >> > > an > > > > >> > > >>>> >> >> arbitrary > > > > >> > > >>>> >> >> >> log > > > > >> > > >>>> >> >> >> >> suffix, which included committed data, from > > being > > > > >> > elected > > > > >> > > >>>> leader. > > > > >> > > >>>> >> >> >> >> However, there is a "last replica standing" > > state > > > > >> which > > > > >> > > when > > > > >> > > >>>> >> combined > > > > >> > > >>>> >> >> >> with > > > > >> > > >>>> >> >> >> >> a data loss unclean shutdown event can turn a > > > local > > > > >> data > > > > >> > > loss > > > > >> > > >>>> >> >> scenario > > > > >> > > >>>> >> >> >> into > > > > >> > > >>>> >> >> >> >> a global data loss scenario, i.e., committed > > data > > > > can > > > > >> be > > > > >> > > >>>> removed > > > > >> > > >>>> >> from > > > > >> > > >>>> >> >> >> all > > > > >> > > >>>> >> >> >> >> replicas. When the last replica in the ISR > > > > experiences > > > > >> > an > > > > >> > > >>>> unclean > > > > >> > > >>>> >> >> >> shutdown > > > > >> > > >>>> >> >> >> >> and loses committed data, it will be > reelected > > > > leader > > > > >> > > after > > > > >> > > >>>> >> starting > > > > >> > > >>>> >> >> up > > > > >> > > >>>> >> >> >> >> again, causing rejoining followers to > truncate > > > > their > > > > >> > logs > > > > >> > > and > > > > >> > > >>>> >> thereby > > > > >> > > >>>> >> >> >> >> removing the last copies of the committed > > records > > > > >> which > > > > >> > > the > > > > >> > > >>>> leader > > > > >> > > >>>> >> >> lost > > > > >> > > >>>> >> >> >> >> initially. > > > > >> > > >>>> >> >> >> >> > > > > >> > > >>>> >> >> >> >> The new KIP will maximize the protection and > > > > provides > > > > >> > > >>>> MinISR-1 > > > > >> > > >>>> >> >> >> tolerance to > > > > >> > > >>>> >> >> >> >> data loss unclean shutdown events. > > > > >> > > >>>> >> >> >> >> > > > > >> > > >>>> >> >> >> >> > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> > > > > >> > > >>>> > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas > > > > >> > > >>>> >> >> >> > > > > >> > > >>>> >> >> > > > > >> > > >>>> >> > > > > >> > > >>>> > > > > >> > > >>> > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > >