Thanks for raising this here, Calvin. Since this is the first "streaming
results" type API in KafkaAdminClient (as far as I know), we're treading
new ground here.

As you mentioned, we can either accept a consumer or return some iterable
result. Returning a java.util.Stream is also an option, and a bit more
modern/convenient than java.util.Iterator. Personally, I like the consumer
approach, but I'm interested in hearing other's opinions.

This actually brings up another question: Do we think it's safe to assume
that one topic's description can fit into memory? The RPC supports paging
across partitions within a single topic, so maybe the admin API should as
well?

-David

On Fri, Feb 23, 2024 at 12:22 PM Calvin Liu <ca...@confluent.io> wrote:

> Hey,
> As we agreed to implement the pagination for the new API
> DescribeTopicPartitions, the client side must also add a proper interface
> to handle the pagination.
> The current KafkaAdminClient.describeTopics returns
> the DescribeTopicsResult which is the future for querying all the topics.
> It is awkward to fit the pagination into it because
>
>    1. Each future corresponds to a topic. We also want to have the
>    pagination on huge topics for their partitions.
>    2. To avoid OOM, we should only fetch the new topics when we need them
>    and release the used topics. Especially the main use case of looping the
>    topic list is when the client prints all the topics.
>
> So, to better serve the pagination, @David Arthur
> <david.art...@confluent.io> suggested to add a new interface in the Admin
> client between the following 2.
>
> describeTopics(TopicCollection topics, DescribeTopicsOptions options, 
> Consumer<TopicDescription>);
>
> Iterator<TopicDescription> describeTopics(TopicCollection topics, 
> DescribeTopicsOptions options);
>
> David and I would prefer the first Consumer version which works better as a 
> stream purposes.
>
>
> On Wed, Oct 11, 2023 at 4:28 PM Calvin Liu <ca...@confluent.io> wrote:
>
>> Hi David,
>> Thanks for the comment.
>> Yes, we can separate the ELR enablement from the metadata version. It is
>> also helpful to avoid blocking the following MV releases if the user is not
>> ready for ELR.
>> One thing to correct is that, the Unclean recovery is controlled
>> by unclean.recovery.manager.enabled, a separate config
>> from unclean.recovery.strategy. It determines whether unclean recovery will
>> be used in an unclean leader election.
>> Thanks
>>
>> On Wed, Oct 11, 2023 at 4:11 PM David Arthur <mum...@gmail.com> wrote:
>>
>>> One thing we should consider is a static config to totally enable/disable
>>> the ELR feature. If I understand the KIP correctly, we can effectively
>>> disable the unclean recovery by setting the recovery strategy config to
>>> "none".
>>>
>>> This would make development and rollout of this feature a bit smoother.
>>> Consider the case that we find bugs in ELR after a cluster has updated to
>>> its MetadataVersion. It's simpler to disable the feature through config
>>> rather than going through a MetadataVersion downgrade (once that's
>>> supported).
>>>
>>> Does that make sense?
>>>
>>> -David
>>>
>>> On Wed, Oct 11, 2023 at 1:40 PM Calvin Liu <ca...@confluent.io.invalid>
>>> wrote:
>>>
>>> > Hi Jun
>>> > -Good catch, yes, we don't need the -1 in the DescribeTopicRequest.
>>> > -No new value is added. The LeaderRecoveryState will still be set to 1
>>> if
>>> > we have an unclean leader election. The unclean leader election
>>> includes
>>> > the old random way and the unclean recovery. During the unclean
>>> recovery,
>>> > the LeaderRecoveryState will not change until the controller decides to
>>> > update the records with the new leader.
>>> > Thanks
>>> >
>>> > On Wed, Oct 11, 2023 at 9:02 AM Jun Rao <j...@confluent.io.invalid>
>>> wrote:
>>> >
>>> > > Hi, Calvin,
>>> > >
>>> > > Another thing. Currently, when there is an unclean leader election,
>>> we
>>> > set
>>> > > the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord
>>> to
>>> > 1.
>>> > > With the KIP, will there be new values for LeaderRecoveryState? If
>>> not,
>>> > > when will LeaderRecoveryState be set to 1?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > > On Tue, Oct 10, 2023 at 4:24 PM Jun Rao <j...@confluent.io> wrote:
>>> > >
>>> > > > Hi, Calvin,
>>> > > >
>>> > > > One more comment.
>>> > > >
>>> > > > "The first partition to fetch details for. -1 means to fetch all
>>> > > > partitions." It seems that FirstPartitionId of 0 naturally means
>>> > fetching
>>> > > > all partitions?
>>> > > >
>>> > > > Thanks,
>>> > > >
>>> > > > Jun
>>> > > >
>>> > > > On Tue, Oct 10, 2023 at 12:40 PM Calvin Liu
>>> <ca...@confluent.io.invalid
>>> > >
>>> > > > wrote:
>>> > > >
>>> > > >> Hi Jun,
>>> > > >> Yeah, with the current Metadata request handling, we only return
>>> > errors
>>> > > on
>>> > > >> the Topic level, like topic not found. It seems that querying a
>>> > specific
>>> > > >> partition is not a valid use case. Will update.
>>> > > >> Thanks
>>> > > >>
>>> > > >> On Tue, Oct 10, 2023 at 11:55 AM Jun Rao <j...@confluent.io.invalid
>>> >
>>> > > >> wrote:
>>> > > >>
>>> > > >> > Hi, Calvin,
>>> > > >> >
>>> > > >> > 60.  If the range query has errors for some of the partitions,
>>> do we
>>> > > >> expect
>>> > > >> > different responses when querying particular partitions?
>>> > > >> >
>>> > > >> > Thanks,
>>> > > >> >
>>> > > >> > Jun
>>> > > >> >
>>> > > >> > On Tue, Oct 10, 2023 at 10:50 AM Calvin Liu
>>> > > <ca...@confluent.io.invalid
>>> > > >> >
>>> > > >> > wrote:
>>> > > >> >
>>> > > >> > > Hi Jun
>>> > > >> > > 60. Yes, it is a good question. I was thinking the API could
>>> be
>>> > > >> flexible
>>> > > >> > to
>>> > > >> > > query the particular partitions if the range query has errors
>>> for
>>> > > >> some of
>>> > > >> > > the partitions. Not sure whether it is a valid assumption,
>>> what do
>>> > > you
>>> > > >> > > think?
>>> > > >> > >
>>> > > >> > > 61. Good point, I will update them to partition level with the
>>> > same
>>> > > >> > limit.
>>> > > >> > >
>>> > > >> > > 62. Sure, will do.
>>> > > >> > >
>>> > > >> > > Thanks
>>> > > >> > >
>>> > > >> > > On Tue, Oct 10, 2023 at 10:12 AM Jun Rao
>>> <j...@confluent.io.invalid
>>> > >
>>> > > >> > wrote:
>>> > > >> > >
>>> > > >> > > > Hi, Calvin,
>>> > > >> > > >
>>> > > >> > > > A few more minor comments on your latest update.
>>> > > >> > > >
>>> > > >> > > > 60. DescribeTopicRequest: When will the Partitions field be
>>> > used?
>>> > > It
>>> > > >> > > seems
>>> > > >> > > > that the FirstPartitionId field is enough for AdminClient
>>> usage.
>>> > > >> > > >
>>> > > >> > > > 61. Could we make the limit for DescribeTopicRequest,
>>> > > >> > > ElectLeadersRequest,
>>> > > >> > > > GetReplicaLogInfo consistent? Currently,
>>> ElectLeadersRequest's
>>> > > >> limit is
>>> > > >> > > at
>>> > > >> > > > topic level and GetReplicaLogInfo has a different partition
>>> > level
>>> > > >> limit
>>> > > >> > > > from DescribeTopicRequest.
>>> > > >> > > >
>>> > > >> > > > 62. Should ElectLeadersRequest.DesiredLeaders be at the same
>>> > level
>>> > > >> as
>>> > > >> > > > ElectLeadersRequest.TopicPartitions.Partitions? In the KIP,
>>> it
>>> > > looks
>>> > > >> > like
>>> > > >> > > > it's at the same level as
>>> ElectLeadersRequest.TopicPartitions.
>>> > > >> > > >
>>> > > >> > > > Thanks,
>>> > > >> > > >
>>> > > >> > > > Jun
>>> > > >> > > >
>>> > > >> > > > On Wed, Oct 4, 2023 at 3:55 PM Calvin Liu
>>> > > >> <ca...@confluent.io.invalid>
>>> > > >> > > > wrote:
>>> > > >> > > >
>>> > > >> > > > > Hi David,
>>> > > >> > > > > Thanks for the comments.
>>> > > >> > > > > ----
>>> > > >> > > > > I thought that a new snapshot with the downgraded MV is
>>> > created
>>> > > in
>>> > > >> > this
>>> > > >> > > > > case. Isn’t it the case?
>>> > > >> > > > > Yes, you are right, a metadata delta will be generated
>>> after
>>> > the
>>> > > >> MV
>>> > > >> > > > > downgrade. Then the user can start the software downgrade.
>>> > > >> > > > > -----
>>> > > >> > > > > Could you also elaborate a bit more on the reasoning
>>> behind
>>> > > adding
>>> > > >> > the
>>> > > >> > > > > limits to the admin RPCs? This is a new pattern in Kafka
>>> so it
>>> > > >> would
>>> > > >> > be
>>> > > >> > > > > good to clear on the motivation.
>>> > > >> > > > > Thanks to Colin for bringing it up. The current
>>> > MetadataRequest
>>> > > >> does
>>> > > >> > > not
>>> > > >> > > > > have a limit on the number of topics to query in a single
>>> > > request.
>>> > > >> > > > Massive
>>> > > >> > > > > requests can mess up the JVM. We want to have some sort of
>>> > > >> throttle
>>> > > >> > on
>>> > > >> > > > the
>>> > > >> > > > > new APIs.
>>> > > >> > > > > -----
>>> > > >> > > > > Could you also explain how the client is supposed to
>>> handle
>>> > the
>>> > > >> > > > > topics/partitions above the limit? I suppose that it will
>>> have
>>> > > to
>>> > > >> > retry
>>> > > >> > > > > those, correct?
>>> > > >> > > > > Corrent. For the official admin clients, it will split the
>>> > large
>>> > > >> > > request
>>> > > >> > > > > into proper pieces and query one after another.
>>> > > >> > > > > -----
>>> > > >> > > > > My understanding is that the topics/partitions above the
>>> limit
>>> > > >> will
>>> > > >> > be
>>> > > >> > > > > failed with an invalid exception error. I wonder if this
>>> > choice
>>> > > is
>>> > > >> > > > > judicious because the invalide request exception is
>>> usually
>>> > > >> fatal. It
>>> > > >> > > may
>>> > > >> > > > > be better to use an new and explicit error for this case.
>>> > > >> > > > >
>>> > > >> > > > > Thanks for bringing this up. How about
>>> > "REQUEST_LIMIT_REACHED"?
>>> > > >> > > > > --------
>>> > > >> > > > > It seems that we still need to specify the changes to the
>>> > admin
>>> > > >> api
>>> > > >> > to
>>> > > >> > > > > accommodate the new or updated apis. Do you plan to add
>>> them?
>>> > > >> > > > > Try to cover the following
>>> > > >> > > > > 1. The admin client will use the new DescribeTopicRequest
>>> to
>>> > > query
>>> > > >> > the
>>> > > >> > > > > topics
>>> > > >> > > > > 2. Mention the API limit and the new retriable error.
>>> > > >> > > > > 3. Output changes for the admin client when describing a
>>> topic
>>> > > >> (new
>>> > > >> > > > fields
>>> > > >> > > > > of ELR...)
>>> > > >> > > > > 4. Changes to data structures like TopicPartitionInfo to
>>> > include
>>> > > >> the
>>> > > >> > > ELR.
>>> > > >> > > > > Anything else I missed?
>>> > > >> > > > >
>>> > > >> > > > > Thanks!
>>> > > >> > > > >
>>> > > >> > > > >
>>> > > >> > > > >
>>> > > >> > > > >
>>> > > >> > > > >
>>> > > >> > > > > On Wed, Oct 4, 2023 at 12:27 PM David Jacot <
>>> > > >> david.ja...@gmail.com>
>>> > > >> > > > wrote:
>>> > > >> > > > >
>>> > > >> > > > > > Hi Calvin,
>>> > > >> > > > > >
>>> > > >> > > > > > I thought that a new snapshot with the downgraded MV is
>>> > > created
>>> > > >> in
>>> > > >> > > this
>>> > > >> > > > > > case. Isn’t it the case?
>>> > > >> > > > > >
>>> > > >> > > > > > Could you also elaborate a bit more on the reasoning
>>> behind
>>> > > >> adding
>>> > > >> > > the
>>> > > >> > > > > > limits to the admin RPCs? This is a new pattern in
>>> Kafka so
>>> > it
>>> > > >> > would
>>> > > >> > > be
>>> > > >> > > > > > good to clear on the motivation.
>>> > > >> > > > > >
>>> > > >> > > > > > Could you also explain how the client is supposed to
>>> handle
>>> > > the
>>> > > >> > > > > > topics/partitions above the limit? I suppose that it
>>> will
>>> > have
>>> > > >> to
>>> > > >> > > retry
>>> > > >> > > > > > those, correct?
>>> > > >> > > > > >
>>> > > >> > > > > > My understanding is that the topics/partitions above the
>>> > limit
>>> > > >> will
>>> > > >> > > be
>>> > > >> > > > > > failed with an invalid exception error. I wonder if this
>>> > > choice
>>> > > >> is
>>> > > >> > > > > > judicious because the invalide request exception is
>>> usually
>>> > > >> fatal.
>>> > > >> > It
>>> > > >> > > > may
>>> > > >> > > > > > be better to use an new and explicit error for this
>>> case.
>>> > > >> > > > > >
>>> > > >> > > > > > It seems that we still need to specify the changes to
>>> the
>>> > > admin
>>> > > >> api
>>> > > >> > > to
>>> > > >> > > > > > accommodate the new or updated apis. Do you plan to add
>>> > them?
>>> > > >> > > > > >
>>> > > >> > > > > > Best,
>>> > > >> > > > > > David
>>> > > >> > > > > >
>>> > > >> > > > > > Le mer. 4 oct. 2023 à 20:39, Calvin Liu
>>> > > >> <ca...@confluent.io.invalid
>>> > > >> > >
>>> > > >> > > a
>>> > > >> > > > > > écrit :
>>> > > >> > > > > >
>>> > > >> > > > > > > Hi Jun,
>>> > > >> > > > > > > After the MV downgrade, the controller will write in
>>> the
>>> > old
>>> > > >> > > version
>>> > > >> > > > of
>>> > > >> > > > > > the
>>> > > >> > > > > > > PartitionRecord/PartitionChangeRecord. If I understand
>>> > > >> correctly,
>>> > > >> > > it
>>> > > >> > > > is
>>> > > >> > > > > > > possible to downgrade the software version if the
>>> > controller
>>> > > >> only
>>> > > >> > > has
>>> > > >> > > > > to
>>> > > >> > > > > > > handle old version records.
>>> > > >> > > > > > > However, the controller will not automatically
>>> rewrite the
>>> > > >> > > > > > PartitionRecord
>>> > > >> > > > > > > with the old version unless there is a partition
>>> update.
>>> > > Then,
>>> > > >> > the
>>> > > >> > > > user
>>> > > >> > > > > > may
>>> > > >> > > > > > > have to wait an unknown amount of time before the
>>> software
>>> > > >> > > downgrades
>>> > > >> > > > > > > unless they do a roll to force update every
>>> partition. If
>>> > it
>>> > > >> > makes
>>> > > >> > > > > > sense, I
>>> > > >> > > > > > > can mention these steps to do a software downgrade.
>>> > > >> > > > > > > Thanks
>>> > > >> > > > > > >
>>> > > >> > > > > > > On Wed, Oct 4, 2023 at 11:20 AM Jun Rao
>>> > > >> <j...@confluent.io.invalid
>>> > > >> > >
>>> > > >> > > > > > wrote:
>>> > > >> > > > > > >
>>> > > >> > > > > > > > Hi, Calvin and Justine,
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > Historically, when we change the record format in
>>> the
>>> > log,
>>> > > >> we
>>> > > >> > > don't
>>> > > >> > > > > > > support
>>> > > >> > > > > > > > software version downgrading.
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > For the record format change in the metadata log,
>>> have
>>> > we
>>> > > >> > thought
>>> > > >> > > > > about
>>> > > >> > > > > > > > forcing the write of the latest metadata records
>>> with
>>> > the
>>> > > >> old
>>> > > >> > > > version
>>> > > >> > > > > > > > during MV downgrading? This will in theory allow
>>> the old
>>> > > >> > version
>>> > > >> > > of
>>> > > >> > > > > the
>>> > > >> > > > > > > > software to obtain the latest metadata.
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > Thanks,
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > Jun
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > On Wed, Oct 4, 2023 at 9:53 AM Justine Olshan
>>> > > >> > > > > > > <jols...@confluent.io.invalid
>>> > > >> > > > > > > > >
>>> > > >> > > > > > > > wrote:
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > > Sorry -- not MV but software version.
>>> > > >> > > > > > > > >
>>> > > >> > > > > > > > > On Wed, Oct 4, 2023 at 9:51 AM Justine Olshan <
>>> > > >> > > > > jols...@confluent.io>
>>> > > >> > > > > > > > > wrote:
>>> > > >> > > > > > > > >
>>> > > >> > > > > > > > > > Catching up with this discussion.
>>> > > >> > > > > > > > > >
>>> > > >> > > > > > > > > > I was just curious -- have we had other
>>> instances
>>> > > where
>>> > > >> > > > > downgrading
>>> > > >> > > > > > > MV
>>> > > >> > > > > > > > is
>>> > > >> > > > > > > > > > not supported? I think Kafka typically tries to
>>> > > support
>>> > > >> > > > > downgrades,
>>> > > >> > > > > > > > and I
>>> > > >> > > > > > > > > > couldn't think of other examples.
>>> > > >> > > > > > > > > >
>>> > > >> > > > > > > > > > Thanks,
>>> > > >> > > > > > > > > > Justine
>>> > > >> > > > > > > > > >
>>> > > >> > > > > > > > > > On Wed, Oct 4, 2023 at 9:40 AM Calvin Liu
>>> > > >> > > > > > <ca...@confluent.io.invalid
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > > > wrote:
>>> > > >> > > > > > > > > >
>>> > > >> > > > > > > > > >> Hi Jun,
>>> > > >> > > > > > > > > >> 54. Marked the software downgrading is not
>>> > supported.
>>> > > >> As
>>> > > >> > the
>>> > > >> > > > old
>>> > > >> > > > > > > > > >> controller
>>> > > >> > > > > > > > > >> will not understand the new PartitionRecord and
>>> > > >> > > > > > > PartitionChangeRecord.
>>> > > >> > > > > > > > > >> Thanks!
>>> > > >> > > > > > > > > >>
>>> > > >> > > > > > > > > >> On Wed, Oct 4, 2023 at 9:12 AM Jun Rao
>>> > > >> > > > <j...@confluent.io.invalid
>>> > > >> > > > > >
>>> > > >> > > > > > > > > wrote:
>>> > > >> > > > > > > > > >>
>>> > > >> > > > > > > > > >> > Hi, Calvin,
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > Thanks for the reply. Just one more comment.
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > 54. It seems that downgrading MV is
>>> supported. Is
>>> > > >> > > > downgrading
>>> > > >> > > > > > the
>>> > > >> > > > > > > > > >> software
>>> > > >> > > > > > > > > >> > version supported? It would be useful to
>>> document
>>> > > >> that.
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > Thanks,
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > Jun
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > On Tue, Oct 3, 2023 at 4:55 PM Artem Livshits
>>> > > >> > > > > > > > > >> > <alivsh...@confluent.io.invalid> wrote:
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > > Hi Colin,
>>> > > >> > > > > > > > > >> > >
>>> > > >> > > > > > > > > >> > > I think in your example
>>> "do_unclean_recovery"
>>> > > would
>>> > > >> > need
>>> > > >> > > > to
>>> > > >> > > > > do
>>> > > >> > > > > > > > > >> different
>>> > > >> > > > > > > > > >> > > things depending on the strategy.
>>> > > >> > > > > > > > > >> > >
>>> > > >> > > > > > > > > >> > > do_unclean_recovery() {
>>> > > >> > > > > > > > > >> > >    if (unclean.recovery.manager.enabled) {
>>> > > >> > > > > > > > > >> > >     if (strategy == Aggressive)
>>> > > >> > > > > > > > > >> > >       use
>>> > > >> > UncleanRecoveryManager(waitLastKnownERL=false)
>>> > > >> > > > //
>>> > > >> > > > > > > just
>>> > > >> > > > > > > > > >> inspect
>>> > > >> > > > > > > > > >> > > logs from whoever is available
>>> > > >> > > > > > > > > >> > >     else
>>> > > >> > > > > > > > > >> > >       use
>>> > > >> > UncleanRecoveryManager(waitLastKnownERL=true)
>>> > > >> > > > //
>>> > > >> > > > > > > must
>>> > > >> > > > > > > > > wait
>>> > > >> > > > > > > > > >> > for
>>> > > >> > > > > > > > > >> > > at least last known ELR
>>> > > >> > > > > > > > > >> > >   } else {
>>> > > >> > > > > > > > > >> > >     if (strategy == Aggressive)
>>> > > >> > > > > > > > > >> > >       choose the last known leader if that
>>> is
>>> > > >> > available,
>>> > > >> > > > or
>>> > > >> > > > > a
>>> > > >> > > > > > > > random
>>> > > >> > > > > > > > > >> > leader
>>> > > >> > > > > > > > > >> > > if not)
>>> > > >> > > > > > > > > >> > >     else
>>> > > >> > > > > > > > > >> > >       wait for last known leader to get
>>> back
>>> > > >> > > > > > > > > >> > >   }
>>> > > >> > > > > > > > > >> > > }
>>> > > >> > > > > > > > > >> > >
>>> > > >> > > > > > > > > >> > > The idea is that the Aggressive strategy
>>> would
>>> > > >> kick in
>>> > > >> > > as
>>> > > >> > > > > soon
>>> > > >> > > > > > > as
>>> > > >> > > > > > > > we
>>> > > >> > > > > > > > > >> lost
>>> > > >> > > > > > > > > >> > > the leader and would pick a leader from
>>> whoever
>>> > > is
>>> > > >> > > > > available;
>>> > > >> > > > > > > but
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > Balanced will only kick in when ELR is
>>> empty
>>> > and
>>> > > >> will
>>> > > >> > > wait
>>> > > >> > > > > for
>>> > > >> > > > > > > the
>>> > > >> > > > > > > > > >> > brokers
>>> > > >> > > > > > > > > >> > > that likely have most data to be available.
>>> > > >> > > > > > > > > >> > >
>>> > > >> > > > > > > > > >> > > On Tue, Oct 3, 2023 at 3:04 PM Colin
>>> McCabe <
>>> > > >> > > > > > cmcc...@apache.org
>>> > > >> > > > > > > >
>>> > > >> > > > > > > > > >> wrote:
>>> > > >> > > > > > > > > >> > >
>>> > > >> > > > > > > > > >> > > > On Tue, Oct 3, 2023, at 10:49, Jun Rao
>>> wrote:
>>> > > >> > > > > > > > > >> > > > > Hi, Calvin,
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > Thanks for the update KIP. A few more
>>> > > comments.
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 41. Why would a user choose the option
>>> to
>>> > > >> select a
>>> > > >> > > > > random
>>> > > >> > > > > > > > > replica
>>> > > >> > > > > > > > > >> as
>>> > > >> > > > > > > > > >> > > the
>>> > > >> > > > > > > > > >> > > > > leader instead of using
>>> > > >> > > > > > unclean.recovery.strateg=Aggressive?
>>> > > >> > > > > > > > It
>>> > > >> > > > > > > > > >> seems
>>> > > >> > > > > > > > > >> > > > that
>>> > > >> > > > > > > > > >> > > > > the latter is strictly better? If
>>> that's
>>> > not
>>> > > >> the
>>> > > >> > > case,
>>> > > >> > > > > > could
>>> > > >> > > > > > > > we
>>> > > >> > > > > > > > > >> fold
>>> > > >> > > > > > > > > >> > > this
>>> > > >> > > > > > > > > >> > > > > option under unclean.recovery.strategy
>>> > > instead
>>> > > >> of
>>> > > >> > > > > > > introducing
>>> > > >> > > > > > > > a
>>> > > >> > > > > > > > > >> > > separate
>>> > > >> > > > > > > > > >> > > > > config?
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > Hi Jun,
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > I thought the flow of control was:
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > If there is no leader for the partition {
>>> > > >> > > > > > > > > >> > > >   If (there are unfenced ELR members) {
>>> > > >> > > > > > > > > >> > > >     choose_an_unfenced_ELR_member
>>> > > >> > > > > > > > > >> > > >   } else if (there are fenced ELR
>>> members AND
>>> > > >> > > > > > > > > strategy=Aggressive) {
>>> > > >> > > > > > > > > >> > > >     do_unclean_recovery
>>> > > >> > > > > > > > > >> > > >   } else if (there are no ELR members AND
>>> > > >> strategy
>>> > > >> > !=
>>> > > >> > > > > None)
>>> > > >> > > > > > {
>>> > > >> > > > > > > > > >> > > >     do_unclean_recovery
>>> > > >> > > > > > > > > >> > > >   } else {
>>> > > >> > > > > > > > > >> > > >     do nothing about the missing leader
>>> > > >> > > > > > > > > >> > > >   }
>>> > > >> > > > > > > > > >> > > > }
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > do_unclean_recovery() {
>>> > > >> > > > > > > > > >> > > >    if (unclean.recovery.manager.enabled)
>>> {
>>> > > >> > > > > > > > > >> > > >     use UncleanRecoveryManager
>>> > > >> > > > > > > > > >> > > >   } else {
>>> > > >> > > > > > > > > >> > > >     choose the last known leader if that
>>> is
>>> > > >> > available,
>>> > > >> > > > or
>>> > > >> > > > > a
>>> > > >> > > > > > > > random
>>> > > >> > > > > > > > > >> > leader
>>> > > >> > > > > > > > > >> > > > if not)
>>> > > >> > > > > > > > > >> > > >   }
>>> > > >> > > > > > > > > >> > > > }
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > However, I think this could be clarified,
>>> > > >> especially
>>> > > >> > > the
>>> > > >> > > > > > > > behavior
>>> > > >> > > > > > > > > >> when
>>> > > >> > > > > > > > > >> > > > unclean.recovery.manager.enabled=false.
>>> > > >> Inuitively
>>> > > >> > the
>>> > > >> > > > > goal
>>> > > >> > > > > > > for
>>> > > >> > > > > > > > > >> > > > unclean.recovery.manager.enabled=false
>>> is to
>>> > be
>>> > > >> "the
>>> > > >> > > > same
>>> > > >> > > > > as
>>> > > >> > > > > > > > now,
>>> > > >> > > > > > > > > >> > mostly"
>>> > > >> > > > > > > > > >> > > > but it's very underspecified in the KIP,
>>> I
>>> > > agree.
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 50. ElectLeadersRequest: "If more than
>>> 20
>>> > > >> topics
>>> > > >> > are
>>> > > >> > > > > > > included,
>>> > > >> > > > > > > > > >> only
>>> > > >> > > > > > > > > >> > the
>>> > > >> > > > > > > > > >> > > > > first 20 will be served. Others will be
>>> > > >> returned
>>> > > >> > > with
>>> > > >> > > > > > > > > >> > DesiredLeaders."
>>> > > >> > > > > > > > > >> > > > Hmm,
>>> > > >> > > > > > > > > >> > > > > not sure that I understand this.
>>> > > >> > > ElectLeadersResponse
>>> > > >> > > > > > > doesn't
>>> > > >> > > > > > > > > >> have a
>>> > > >> > > > > > > > > >> > > > > DesiredLeaders field.
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 51. GetReplicaLogInfo: "If more than
>>> 2000
>>> > > >> > partitions
>>> > > >> > > > are
>>> > > >> > > > > > > > > included,
>>> > > >> > > > > > > > > >> > only
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > > first 2000 will be served" Do we
>>> return an
>>> > > >> error
>>> > > >> > for
>>> > > >> > > > the
>>> > > >> > > > > > > > > remaining
>>> > > >> > > > > > > > > >> > > > > partitions? Actually, should we
>>> include an
>>> > > >> > errorCode
>>> > > >> > > > > field
>>> > > >> > > > > > > at
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > > partition
>>> > > >> > > > > > > > > >> > > > > level in GetReplicaLogInfoResponse to
>>> cover
>>> > > >> > > > non-existing
>>> > > >> > > > > > > > > >> partitions
>>> > > >> > > > > > > > > >> > and
>>> > > >> > > > > > > > > >> > > > no
>>> > > >> > > > > > > > > >> > > > > authorization, etc?
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 52. The entry should matches => The
>>> entry
>>> > > >> should
>>> > > >> > > match
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 53. ElectLeadersRequest.DesiredLeaders:
>>> > > Should
>>> > > >> it
>>> > > >> > be
>>> > > >> > > > > > > nullable
>>> > > >> > > > > > > > > >> since a
>>> > > >> > > > > > > > > >> > > > user
>>> > > >> > > > > > > > > >> > > > > may not specify DesiredLeaders?
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 54. Downgrade: Is that indeed
>>> possible? I
>>> > > >> thought
>>> > > >> > > > > earlier
>>> > > >> > > > > > > you
>>> > > >> > > > > > > > > said
>>> > > >> > > > > > > > > >> > that
>>> > > >> > > > > > > > > >> > > > > once the new version of the records
>>> are in
>>> > > the
>>> > > >> > > > metadata
>>> > > >> > > > > > log,
>>> > > >> > > > > > > > one
>>> > > >> > > > > > > > > >> > can't
>>> > > >> > > > > > > > > >> > > > > downgrade since the old broker doesn't
>>> know
>>> > > >> how to
>>> > > >> > > > parse
>>> > > >> > > > > > the
>>> > > >> > > > > > > > new
>>> > > >> > > > > > > > > >> > > version
>>> > > >> > > > > > > > > >> > > > of
>>> > > >> > > > > > > > > >> > > > > the metadata records?
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > MetadataVersion downgrade is currently
>>> broken
>>> > > >> but we
>>> > > >> > > > have
>>> > > >> > > > > > > fixing
>>> > > >> > > > > > > > > it
>>> > > >> > > > > > > > > >> on
>>> > > >> > > > > > > > > >> > > our
>>> > > >> > > > > > > > > >> > > > plate for Kafka 3.7.
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > The way downgrade works is that "new
>>> > features"
>>> > > >> are
>>> > > >> > > > > dropped,
>>> > > >> > > > > > > > > leaving
>>> > > >> > > > > > > > > >> > only
>>> > > >> > > > > > > > > >> > > > the old ones.
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > > 55. CleanShutdownFile: Should we add a
>>> > > version
>>> > > >> > field
>>> > > >> > > > for
>>> > > >> > > > > > > > future
>>> > > >> > > > > > > > > >> > > > extension?
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > 56. Config changes are public facing.
>>> Could
>>> > > we
>>> > > >> > have
>>> > > >> > > a
>>> > > >> > > > > > > separate
>>> > > >> > > > > > > > > >> > section
>>> > > >> > > > > > > > > >> > > to
>>> > > >> > > > > > > > > >> > > > > document all the config changes?
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > +1. A separate section for this would be
>>> > good.
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > best,
>>> > > >> > > > > > > > > >> > > > Colin
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > Thanks,
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > Jun
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > > On Mon, Sep 25, 2023 at 4:29 PM Calvin
>>> Liu
>>> > > >> > > > > > > > > >> > <ca...@confluent.io.invalid
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > > > > > > > > >> > > > > wrote:
>>> > > >> > > > > > > > > >> > > > >
>>> > > >> > > > > > > > > >> > > > >> Hi Jun
>>> > > >> > > > > > > > > >> > > > >> Thanks for the comments.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 40. If we change to None, it is not
>>> > > guaranteed
>>> > > >> > for
>>> > > >> > > no
>>> > > >> > > > > > data
>>> > > >> > > > > > > > > loss.
>>> > > >> > > > > > > > > >> For
>>> > > >> > > > > > > > > >> > > > users
>>> > > >> > > > > > > > > >> > > > >> who are not able to validate the data
>>> with
>>> > > >> > external
>>> > > >> > > > > > > > resources,
>>> > > >> > > > > > > > > >> > manual
>>> > > >> > > > > > > > > >> > > > >> intervention does not give a better
>>> result
>>> > > >> but a
>>> > > >> > > loss
>>> > > >> > > > > of
>>> > > >> > > > > > > > > >> > availability.
>>> > > >> > > > > > > > > >> > > > So
>>> > > >> > > > > > > > > >> > > > >> practically speaking, the Balance mode
>>> > would
>>> > > >> be a
>>> > > >> > > > > better
>>> > > >> > > > > > > > > default
>>> > > >> > > > > > > > > >> > > value.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 41. No, it represents how we want to
>>> do
>>> > the
>>> > > >> > unclean
>>> > > >> > > > > > leader
>>> > > >> > > > > > > > > >> election.
>>> > > >> > > > > > > > > >> > > If
>>> > > >> > > > > > > > > >> > > > it
>>> > > >> > > > > > > > > >> > > > >> is false, the unclean leader election
>>> will
>>> > > be
>>> > > >> the
>>> > > >> > > old
>>> > > >> > > > > > > random
>>> > > >> > > > > > > > > way.
>>> > > >> > > > > > > > > >> > > > >> Otherwise, the unclean recovery will
>>> be
>>> > > used.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 42. Good catch. Updated.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 43. Only the first 20 topics will be
>>> > served.
>>> > > >> > Others
>>> > > >> > > > > will
>>> > > >> > > > > > be
>>> > > >> > > > > > > > > >> returned
>>> > > >> > > > > > > > > >> > > > with
>>> > > >> > > > > > > > > >> > > > >> InvalidRequestError
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 44. The order matters. The desired
>>> leader
>>> > > >> entries
>>> > > >> > > > match
>>> > > >> > > > > > > with
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > topic
>>> > > >> > > > > > > > > >> > > > >> partition list by the index.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 45. Thanks! Updated.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 46. Good advice! Updated.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 47.1, updated the comment. Basically
>>> it
>>> > will
>>> > > >> > elect
>>> > > >> > > > the
>>> > > >> > > > > > > > replica
>>> > > >> > > > > > > > > in
>>> > > >> > > > > > > > > >> > the
>>> > > >> > > > > > > > > >> > > > >> desiredLeader field to be the leader
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 47.2 We can let the admin client do
>>> the
>>> > > >> > conversion.
>>> > > >> > > > > Using
>>> > > >> > > > > > > the
>>> > > >> > > > > > > > > >> > > > desiredLeader
>>> > > >> > > > > > > > > >> > > > >> field in the json format seems easier
>>> for
>>> > > >> users.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 48. Once the MV version is
>>> downgraded, all
>>> > > the
>>> > > >> > ELR
>>> > > >> > > > > > related
>>> > > >> > > > > > > > > fields
>>> > > >> > > > > > > > > >> > will
>>> > > >> > > > > > > > > >> > > > be
>>> > > >> > > > > > > > > >> > > > >> removed on the next partition change.
>>> The
>>> > > >> > > controller
>>> > > >> > > > > will
>>> > > >> > > > > > > > also
>>> > > >> > > > > > > > > >> > ignore
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> ELR fields. Updated the KIP.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> 49. Yes, it would be
>>> deprecated/removed.
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> On Mon, Sep 25, 2023 at 3:49 PM Jun
>>> Rao
>>> > > >> > > > > > > > > <j...@confluent.io.invalid
>>> > > >> > > > > > > > > >> >
>>> > > >> > > > > > > > > >> > > > wrote:
>>> > > >> > > > > > > > > >> > > > >>
>>> > > >> > > > > > > > > >> > > > >> > Hi, Calvin,
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > Thanks for the updated KIP. Made
>>> another
>>> > > >> pass.
>>> > > >> > A
>>> > > >> > > > few
>>> > > >> > > > > > more
>>> > > >> > > > > > > > > >> comments
>>> > > >> > > > > > > > > >> > > > below.
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 40.
>>> unclean.leader.election.enable.false
>>> > > ->
>>> > > >> > > > > > > > > >> > > > >> > unclean.recovery.strategy.Balanced:
>>> The
>>> > > >> > Balanced
>>> > > >> > > > mode
>>> > > >> > > > > > > could
>>> > > >> > > > > > > > > >> still
>>> > > >> > > > > > > > > >> > > > lead to
>>> > > >> > > > > > > > > >> > > > >> > data loss. So, I am wondering if
>>> > > >> > > > > > > > > >> > > unclean.leader.election.enable.false
>>> > > >> > > > > > > > > >> > > > >> > should map to None?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 41.
>>> unclean.recovery.manager.enabled: I
>>> > am
>>> > > >> not
>>> > > >> > > sure
>>> > > >> > > > > why
>>> > > >> > > > > > > we
>>> > > >> > > > > > > > > >> > introduce
>>> > > >> > > > > > > > > >> > > > this
>>> > > >> > > > > > > > > >> > > > >> > additional config. Is it the same as
>>> > > >> > > > > > > > > >> > unclean.recovery.strategy=None?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 42.
>>> > > >> > > > DescribeTopicResponse.TopicAuthorizedOperations:
>>> > > >> > > > > > > Should
>>> > > >> > > > > > > > > >> this
>>> > > >> > > > > > > > > >> > be
>>> > > >> > > > > > > > > >> > > at
>>> > > >> > > > > > > > > >> > > > >> the
>>> > > >> > > > > > > > > >> > > > >> > topic level?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 43. "Limit: 20 topics max per
>>> request":
>>> > > >> Could
>>> > > >> > we
>>> > > >> > > > > > describe
>>> > > >> > > > > > > > > what
>>> > > >> > > > > > > > > >> > > > happens if
>>> > > >> > > > > > > > > >> > > > >> > the request includes more than 20
>>> > topics?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 44.
>>> ElectLeadersRequest.DesiredLeaders:
>>> > > >> Could
>>> > > >> > we
>>> > > >> > > > > > describe
>>> > > >> > > > > > > > > >> whether
>>> > > >> > > > > > > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > ordering matters?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 45.
>>> GetReplicaLogInfo.TopicPartitions:
>>> > > >> "about":
>>> > > >> > > > "The
>>> > > >> > > > > > > topic
>>> > > >> > > > > > > > > >> > > partitions
>>> > > >> > > > > > > > > >> > > > to
>>> > > >> > > > > > > > > >> > > > >> > elect leaders.": The description in
>>> > > "about"
>>> > > >> is
>>> > > >> > > > > > incorrect.
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 46. GetReplicaLogInfoResponse:
>>> Should we
>>> > > >> nest
>>> > > >> > > > > > partitions
>>> > > >> > > > > > > > > under
>>> > > >> > > > > > > > > >> > > > topicId to
>>> > > >> > > > > > > > > >> > > > >> > be consistent with other types of
>>> > > responses?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 47. kafka-leader-election.sh:
>>> > > >> > > > > > > > > >> > > > >> > 47.1 Could we explain DESIGNATION?
>>> > > >> > > > > > > > > >> > > > >> > 47.2 desiredLeader: Should it be a
>>> list
>>> > to
>>> > > >> > match
>>> > > >> > > > the
>>> > > >> > > > > > > field
>>> > > >> > > > > > > > in
>>> > > >> > > > > > > > > >> > > > >> > ElectLeadersRequest?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 48. We could add a section on
>>> downgrade?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > 49. LastKnownLeader: This seems only
>>> > > needed
>>> > > >> in
>>> > > >> > > the
>>> > > >> > > > > > first
>>> > > >> > > > > > > > > phase
>>> > > >> > > > > > > > > >> of
>>> > > >> > > > > > > > > >> > > > >> > delivering ELR. Will it be removed
>>> when
>>> > > the
>>> > > >> > > > complete
>>> > > >> > > > > > KIP
>>> > > >> > > > > > > is
>>> > > >> > > > > > > > > >> > > delivered?
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > Thanks,
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > Jun
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > On Tue, Sep 19, 2023 at 1:30 PM
>>> Colin
>>> > > >> McCabe <
>>> > > >> > > > > > > > > >> cmcc...@apache.org>
>>> > > >> > > > > > > > > >> > > > wrote:
>>> > > >> > > > > > > > > >> > > > >> >
>>> > > >> > > > > > > > > >> > > > >> > > Hi Calvin,
>>> > > >> > > > > > > > > >> > > > >> > >
>>> > > >> > > > > > > > > >> > > > >> > > Thanks for the explanations. I
>>> like
>>> > the
>>> > > >> idea
>>> > > >> > of
>>> > > >> > > > > using
>>> > > >> > > > > > > > none,
>>> > > >> > > > > > > > > >> > > > balanced,
>>> > > >> > > > > > > > > >> > > > >> > > aggressive. We also had an offline
>>> > > >> discussion
>>> > > >> > > > about
>>> > > >> > > > > > why
>>> > > >> > > > > > > > it
>>> > > >> > > > > > > > > is
>>> > > >> > > > > > > > > >> > good
>>> > > >> > > > > > > > > >> > > > to
>>> > > >> > > > > > > > > >> > > > >> > use a
>>> > > >> > > > > > > > > >> > > > >> > > new config key (basically, so
>>> that we
>>> > > can
>>> > > >> > > > deprecate
>>> > > >> > > > > > the
>>> > > >> > > > > > > > old
>>> > > >> > > > > > > > > >> one
>>> > > >> > > > > > > > > >> > > > which
>>> > > >> > > > > > > > > >> > > > >> had
>>> > > >> > > > > > > > > >> > > > >> > > only false/true values in 4.0)
>>> With
>>> > > these
>>> > > >> > > > changes,
>>> > > >> > > > > I
>>> > > >> > > > > > am
>>> > > >> > > > > > > > +1.
>>> > > >> > > > > > > > > >> > > > >> > >
>>> > > >> > > > > > > > > >> > > > >> > > best,
>>> > > >> > > > > > > > > >> > > > >> > > Colin
>>> > > >> > > > > > > > > >> > > > >> > >
>>> > > >> > > > > > > > > >> > > > >> > > On Mon, Sep 18, 2023, at 15:54,
>>> Calvin
>>> > > Liu
>>> > > >> > > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > > Hi Colin,
>>> > > >> > > > > > > > > >> > > > >> > > > Also, can we deprecate
>>> > > >> > > > > > unclean.leader.election.enable
>>> > > >> > > > > > > > in
>>> > > >> > > > > > > > > >> 4.0?
>>> > > >> > > > > > > > > >> > > > Before
>>> > > >> > > > > > > > > >> > > > >> > > that,
>>> > > >> > > > > > > > > >> > > > >> > > > we can have both the config
>>> > > >> > > > > > unclean.recovery.strategy
>>> > > >> > > > > > > > and
>>> > > >> > > > > > > > > >> > > > >> > > > unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > > >> > > > and using the
>>> > unclean.recovery.Enabled
>>> > > >> to
>>> > > >> > > > > determine
>>> > > >> > > > > > > > which
>>> > > >> > > > > > > > > >> > config
>>> > > >> > > > > > > > > >> > > > to
>>> > > >> > > > > > > > > >> > > > >> use
>>> > > >> > > > > > > > > >> > > > >> > > > during the unclean leader
>>> election.
>>> > > >> > > > > > > > > >> > > > >> > > >
>>> > > >> > > > > > > > > >> > > > >> > > > On Mon, Sep 18, 2023 at 3:51 PM
>>> > Calvin
>>> > > >> Liu
>>> > > >> > <
>>> > > >> > > > > > > > > >> > ca...@confluent.io>
>>> > > >> > > > > > > > > >> > > > >> wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >
>>> > > >> > > > > > > > > >> > > > >> > > >> Hi Colin,
>>> > > >> > > > > > > > > >> > > > >> > > >> For the
>>> unclean.recovery.strategy
>>> > > >> config
>>> > > >> > > name,
>>> > > >> > > > > how
>>> > > >> > > > > > > > about
>>> > > >> > > > > > > > > >> we
>>> > > >> > > > > > > > > >> > use
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > > >> following
>>> > > >> > > > > > > > > >> > > > >> > > >> None. It basically means no
>>> unclean
>>> > > >> > recovery
>>> > > >> > > > > will
>>> > > >> > > > > > be
>>> > > >> > > > > > > > > >> > performed.
>>> > > >> > > > > > > > > >> > > > >> > > >> Aggressive. It means
>>> availability
>>> > > goes
>>> > > >> > > first.
>>> > > >> > > > > > > Whenever
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > > partition
>>> > > >> > > > > > > > > >> > > > >> > > can't
>>> > > >> > > > > > > > > >> > > > >> > > >> elect a durable replica, the
>>> > > controller
>>> > > >> > will
>>> > > >> > > > try
>>> > > >> > > > > > the
>>> > > >> > > > > > > > > >> unclean
>>> > > >> > > > > > > > > >> > > > >> recovery.
>>> > > >> > > > > > > > > >> > > > >> > > >> Balanced. It is the balance
>>> point
>>> > of
>>> > > >> the
>>> > > >> > > > > > > availability
>>> > > >> > > > > > > > > >> > > > >> > first(Aggressive)
>>> > > >> > > > > > > > > >> > > > >> > > >> and least availability(None).
>>> The
>>> > > >> > controller
>>> > > >> > > > > > > performs
>>> > > >> > > > > > > > > >> unclean
>>> > > >> > > > > > > > > >> > > > >> recovery
>>> > > >> > > > > > > > > >> > > > >> > > when
>>> > > >> > > > > > > > > >> > > > >> > > >> both ISR and ELR are empty.
>>> > > >> > > > > > > > > >> > > > >> > > >>
>>> > > >> > > > > > > > > >> > > > >> > > >>
>>> > > >> > > > > > > > > >> > > > >> > > >> On Fri, Sep 15, 2023 at
>>> 11:42 AM
>>> > > Calvin
>>> > > >> > Liu
>>> > > >> > > <
>>> > > >> > > > > > > > > >> > > ca...@confluent.io>
>>> > > >> > > > > > > > > >> > > > >> > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>
>>> > > >> > > > > > > > > >> > > > >> > > >>> Hi Colin,
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> > So, the proposal is that if
>>> > > someone
>>> > > >> > sets
>>> > > >> > > > > > > > > >> > > > >> > > "unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > > >> > > >>> = true"...
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> The idea is to use one of the
>>> > > >> > > > > > > > > >> unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > and
>>> > > >> > > > > > > > > >> > > > >> > > >>> unclean.recovery.strategy
>>> based on
>>> > > the
>>> > > >> > > > > > > > > >> > > > unclean.recovery.Enabled. A
>>> > > >> > > > > > > > > >> > > > >> > > possible
>>> > > >> > > > > > > > > >> > > > >> > > >>> version can be
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> If unclean.recovery.Enabled:
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> {
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> Check
>>> unclean.recovery.strategy.
>>> > If
>>> > > >> set,
>>> > > >> > > use
>>> > > >> > > > > it.
>>> > > >> > > > > > > > > >> Otherwise,
>>> > > >> > > > > > > > > >> > > > check
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> unclean.leader.election.enable and
>>> > > >> > > translate
>>> > > >> > > > it
>>> > > >> > > > > > to
>>> > > >> > > > > > > > > >> > > > >> > > >>> unclean.recovery.strategy.
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> } else {
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> Use
>>> unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> }
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> —--------
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> >The configuration key should
>>> be
>>> > > >> > > > > > > > > >> > > > >> "unclean.recovery.manager.enabled",
>>> > > >> > > > > > > > > >> > > > >> > > >>> right?
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> I think we have two ways of
>>> > > choosing a
>>> > > >> > > leader
>>> > > >> > > > > > > > > uncleanly,
>>> > > >> > > > > > > > > >> > > unclean
>>> > > >> > > > > > > > > >> > > > >> > leader
>>> > > >> > > > > > > > > >> > > > >> > > >>> election and unclean
>>> recovery(log
>>> > > >> > > inspection)
>>> > > >> > > > > and
>>> > > >> > > > > > > we
>>> > > >> > > > > > > > > try
>>> > > >> > > > > > > > > >> to
>>> > > >> > > > > > > > > >> > > > switch
>>> > > >> > > > > > > > > >> > > > >> > > between
>>> > > >> > > > > > > > > >> > > > >> > > >>> them.
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> Do you mean we want to
>>> develop two
>>> > > >> ways
>>> > > >> > of
>>> > > >> > > > > > > performing
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > > unclean
>>> > > >> > > > > > > > > >> > > > >> > > >>> recovery and one of them is
>>> using
>>> > > >> > “unclean
>>> > > >> > > > > > recovery
>>> > > >> > > > > > > > > >> > manager”?
>>> > > >> > > > > > > > > >> > > I
>>> > > >> > > > > > > > > >> > > > >> guess
>>> > > >> > > > > > > > > >> > > > >> > > we
>>> > > >> > > > > > > > > >> > > > >> > > >>> haven’t discussed the second
>>> way.
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> —-------
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> >How do these 4 levels of
>>> > overrides
>>> > > >> > > interact
>>> > > >> > > > > with
>>> > > >> > > > > > > > your
>>> > > >> > > > > > > > > >> new
>>> > > >> > > > > > > > > >> > > > >> > > >>> configurations?
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> I do notice in the Kraft
>>> > controller
>>> > > >> code,
>>> > > >> > > the
>>> > > >> > > > > > > method
>>> > > >> > > > > > > > to
>>> > > >> > > > > > > > > >> > check
>>> > > >> > > > > > > > > >> > > > >> whether
>>> > > >> > > > > > > > > >> > > > >> > > >>> perform unclean leader
>>> election is
>>> > > >> hard
>>> > > >> > > coded
>>> > > >> > > > > to
>>> > > >> > > > > > > > false
>>> > > >> > > > > > > > > >> since
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > 2021(uncleanLeaderElectionEnabledForTopic).
>>> > > >> > > > > Isn’t
>>> > > >> > > > > > > it
>>> > > >> > > > > > > > a
>>> > > >> > > > > > > > > >> good
>>> > > >> > > > > > > > > >> > > > chance
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > >>> completely deprecate the
>>> > > >> > > > > > > > > unclean.leader.election.enable?
>>> > > >> > > > > > > > > >> We
>>> > > >> > > > > > > > > >> > > > don’t
>>> > > >> > > > > > > > > >> > > > >> > even
>>> > > >> > > > > > > > > >> > > > >> > > have
>>> > > >> > > > > > > > > >> > > > >> > > >>> to worry about the config
>>> > > conversion.
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> On the other hand, whatever
>>> the
>>> > > >> override
>>> > > >> > > is,
>>> > > >> > > > as
>>> > > >> > > > > > > long
>>> > > >> > > > > > > > as
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > >> > controller
>>> > > >> > > > > > > > > >> > > > >> > > >>> can have the final effective
>>> > > >> > > > > > > > > >> unclean.leader.election.enable,
>>> > > >> > > > > > > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > topic
>>> > > >> > > > > > > > > >> > > > >> > > >>> level config
>>> > > >> unclean.recovery.strategy,
>>> > > >> > the
>>> > > >> > > > > > cluster
>>> > > >> > > > > > > > > level
>>> > > >> > > > > > > > > >> > > config
>>> > > >> > > > > > > > > >> > > > >> > > >>> unclean.recovery.Enabled, the
>>> > > >> controller
>>> > > >> > > can
>>> > > >> > > > > > > > calculate
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > correct
>>> > > >> > > > > > > > > >> > > > >> > > methods
>>> > > >> > > > > > > > > >> > > > >> > > >>> to use right?
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>> On Fri, Sep 15, 2023 at
>>> 10:02 AM
>>> > > Colin
>>> > > >> > > > McCabe <
>>> > > >> > > > > > > > > >> > > > cmcc...@apache.org>
>>> > > >> > > > > > > > > >> > > > >> > > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> On Thu, Sep 14, 2023, at
>>> 22:23,
>>> > > >> Calvin
>>> > > >> > Liu
>>> > > >> > > > > > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > Hi Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > 1. I think using the new
>>> config
>>> > > >> name
>>> > > >> > is
>>> > > >> > > > more
>>> > > >> > > > > > > > clear.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >        a. The unclean
>>> leader
>>> > > >> election
>>> > > >> > is
>>> > > >> > > > > > > actually
>>> > > >> > > > > > > > > >> removed
>>> > > >> > > > > > > > > >> > > if
>>> > > >> > > > > > > > > >> > > > >> > unclean
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > recovery is in use.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >        b. Using multiple
>>> values
>>> > > in
>>> > > >> > > > > > > > > >> > > > >> unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > > >> > is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > confusing and it will be
>>> more
>>> > > >> > confusing
>>> > > >> > > > > after
>>> > > >> > > > > > > > people
>>> > > >> > > > > > > > > >> > forget
>>> > > >> > > > > > > > > >> > > > >> about
>>> > > >> > > > > > > > > >> > > > >> > > this
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > discussion.
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> Hi Calvin,
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> So, the proposal is that if
>>> > someone
>>> > > >> sets
>>> > > >> > > > > > > > > >> > > > >> > > "unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > > >> > > >>>> = true" but then sets one of
>>> your
>>> > > new
>>> > > >> > > > > > > > configurations,
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > value of
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> unclean.leader.election.enable is
>>> > > >> > ignored?
>>> > > >> > > > > That
>>> > > >> > > > > > > > seems
>>> > > >> > > > > > > > > >> less
>>> > > >> > > > > > > > > >> > > > clear
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > me, not
>>> > > >> > > > > > > > > >> > > > >> > > >>>> more. Just in general, having
>>> > > >> multiple
>>> > > >> > > > > > > configuration
>>> > > >> > > > > > > > > >> keys
>>> > > >> > > > > > > > > >> > to
>>> > > >> > > > > > > > > >> > > > >> control
>>> > > >> > > > > > > > > >> > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> same thing confuses users.
>>> > > Basically,
>>> > > >> > they
>>> > > >> > > > are
>>> > > >> > > > > > > > sitting
>>> > > >> > > > > > > > > >> at a
>>> > > >> > > > > > > > > >> > > > giant
>>> > > >> > > > > > > > > >> > > > >> > > control
>>> > > >> > > > > > > > > >> > > > >> > > >>>> panel, and some of the
>>> levers do
>>> > > >> > nothing.
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > 2. Sorry I forgot to
>>> mention in
>>> > > the
>>> > > >> > > > response
>>> > > >> > > > > > > that
>>> > > >> > > > > > > > I
>>> > > >> > > > > > > > > >> did
>>> > > >> > > > > > > > > >> > add
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > unclean.recovery.Enabled
>>> flag.
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> The configuration key should
>>> be
>>> > > >> > > > > > > > > >> > > > >> "unclean.recovery.manager.enabled",
>>> > > >> > > > > > > > > >> > > > >> > > >>>> right? Becuase we can do
>>> "unclean
>>> > > >> > > recovery"
>>> > > >> > > > > > > without
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > > manager.
>>> > > >> > > > > > > > > >> > > > >> > > Disabling
>>> > > >> > > > > > > > > >> > > > >> > > >>>> the manager just means we
>>> use a
>>> > > >> > different
>>> > > >> > > > > > > mechanism
>>> > > >> > > > > > > > > for
>>> > > >> > > > > > > > > >> > > > recovery.
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >        c. Maybe I
>>> > underestimated
>>> > > >> the
>>> > > >> > > > > challenge
>>> > > >> > > > > > > of
>>> > > >> > > > > > > > > >> > replacing
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> config. Any
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > implementation problems
>>> ahead?
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> There are four levels of
>>> > overrides
>>> > > >> for
>>> > > >> > > > > > > > > >> > > > >> > unclean.leader.election.enable.
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> 1. static configuration for
>>> node.
>>> > > >> > > > > > > > > >> > > > >> > > >>>>     This goes in the
>>> > configuration
>>> > > >> file,
>>> > > >> > > > > > typically
>>> > > >> > > > > > > > > named
>>> > > >> > > > > > > > > >> > > > >> > > >>>> server.properties
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> 2. dynamic configuration for
>>> node
>>> > > >> > default
>>> > > >> > > > > > > > > >> > > > >> > > >>>>   ConfigResource(type=BROKER,
>>> > > >> name="")
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> 3. dynamic configuration for
>>> node
>>> > > >> > > > > > > > > >> > > > >> > > >>>>   ConfigResource(type=BROKER,
>>> > > >> > > > name=<controller
>>> > > >> > > > > > > id>)
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> 4. dynamic configuration for
>>> > topic
>>> > > >> > > > > > > > > >> > > > >> > > >>>>   ConfigResource(type=TOPIC,
>>> > > >> > > > > name=<topic-name>)
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> How do these 4 levels of
>>> > overrides
>>> > > >> > > interact
>>> > > >> > > > > with
>>> > > >> > > > > > > > your
>>> > > >> > > > > > > > > >> new
>>> > > >> > > > > > > > > >> > > > >> > > >>>> configurations? If the new
>>> > > >> > configurations
>>> > > >> > > > > > dominate
>>> > > >> > > > > > > > > over
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > old
>>> > > >> > > > > > > > > >> > > > >> > ones,
>>> > > >> > > > > > > > > >> > > > >> > > it
>>> > > >> > > > > > > > > >> > > > >> > > >>>> seems like this will get a
>>> lot
>>> > more
>>> > > >> > > > confusing
>>> > > >> > > > > to
>>> > > >> > > > > > > > > >> implement
>>> > > >> > > > > > > > > >> > > (and
>>> > > >> > > > > > > > > >> > > > >> also
>>> > > >> > > > > > > > > >> > > > >> > > to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> use.)
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> Again, I'd recommend just
>>> adding
>>> > > some
>>> > > >> > new
>>> > > >> > > > > values
>>> > > >> > > > > > > to
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> unclean.leader.election.enable.
>>> > > It's
>>> > > >> > > simple
>>> > > >> > > > > and
>>> > > >> > > > > > > will
>>> > > >> > > > > > > > > >> > prevent
>>> > > >> > > > > > > > > >> > > > user
>>> > > >> > > > > > > > > >> > > > >> > > confusion
>>> > > >> > > > > > > > > >> > > > >> > > >>>> (as well as developer
>>> confusion.)
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> best,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > 3. About the admin client,
>>> I
>>> > > >> > mentioned 3
>>> > > >> > > > > > changes
>>> > > >> > > > > > > > in
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > client.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> Anything
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > else I missed in the KIP?
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >       a. The client will
>>> switch
>>> > > to
>>> > > >> > using
>>> > > >> > > > the
>>> > > >> > > > > > new
>>> > > >> > > > > > > > RPC
>>> > > >> > > > > > > > > >> > > instead
>>> > > >> > > > > > > > > >> > > > of
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > MetadataRequest for the
>>> topics.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >       b. The
>>> TopicPartitionInfo
>>> > > >> used
>>> > > >> > in
>>> > > >> > > > > > > > > >> TopicDescription
>>> > > >> > > > > > > > > >> > > > needs
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > add
>>> > > >> > > > > > > > > >> > > > >> > > >>>> new
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > fields related to the ELR.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >       c. The outputs will
>>> add
>>> > the
>>> > > >> ELR
>>> > > >> > > > > related
>>> > > >> > > > > > > > > fields.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> > On Thu, Sep 14, 2023 at
>>> 9:19 PM
>>> > > >> Colin
>>> > > >> > > > > McCabe <
>>> > > >> > > > > > > > > >> > > > >> cmcc...@apache.org>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> Hi Calvin,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> Thanks for the changes.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> 1. Earlier I commented
>>> that
>>> > > >> creating
>>> > > >> > > > > > > > > >> > > > >> "unclean.recovery.strategy "
>>> > > >> > > > > > > > > >> > > > >> > > is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> not
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> necessary, and we can just
>>> > reuse
>>> > > >> the
>>> > > >> > > > > existing
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > "unclean.leader.election.enable"
>>> > > >> > > > > > configuration
>>> > > >> > > > > > > > key.
>>> > > >> > > > > > > > > >> > Let's
>>> > > >> > > > > > > > > >> > > > >> discuss
>>> > > >> > > > > > > > > >> > > > >> > > >>>> that.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> 2.I also don't understand
>>> why
>>> > > you
>>> > > >> > > didn't
>>> > > >> > > > > add
>>> > > >> > > > > > a
>>> > > >> > > > > > > > > >> > > > configuration to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> enable or
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> disable the Unclean
>>> Recovery
>>> > > >> Manager.
>>> > > >> > > > This
>>> > > >> > > > > > > seems
>>> > > >> > > > > > > > > >> like a
>>> > > >> > > > > > > > > >> > > very
>>> > > >> > > > > > > > > >> > > > >> > simple
>>> > > >> > > > > > > > > >> > > > >> > > >>>> way to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> handle the staging issue
>>> which
>>> > > we
>>> > > >> > > > > discussed.
>>> > > >> > > > > > > The
>>> > > >> > > > > > > > > URM
>>> > > >> > > > > > > > > >> can
>>> > > >> > > > > > > > > >> > > > just
>>> > > >> > > > > > > > > >> > > > >> be
>>> > > >> > > > > > > > > >> > > > >> > > >>>> turned off
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> until it is production
>>> ready.
>>> > > >> Let's
>>> > > >> > > > discuss
>>> > > >> > > > > > > this.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> 3. You still need to
>>> describe
>>> > > the
>>> > > >> > > changes
>>> > > >> > > > > to
>>> > > >> > > > > > > > > >> AdminClient
>>> > > >> > > > > > > > > >> > > > that
>>> > > >> > > > > > > > > >> > > > >> are
>>> > > >> > > > > > > > > >> > > > >> > > >>>> needed
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> to use
>>> DescribeTopicRequest.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> Keep at it. It's looking
>>> > better.
>>> > > >> :)
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> best,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> On Thu, Sep 14, 2023, at
>>> > 11:03,
>>> > > >> > Calvin
>>> > > >> > > > Liu
>>> > > >> > > > > > > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> > Hi Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> > Thanks for the comments!
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> > I did the following
>>> changes
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    1.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    Simplified the API
>>> spec
>>> > > >> section
>>> > > >> > to
>>> > > >> > > > > only
>>> > > >> > > > > > > > > include
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > diff.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    2.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    Reordered the HWM
>>> > > requirement
>>> > > >> > > > section.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    3.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    Removed the URM
>>> > > >> implementation
>>> > > >> > > > details
>>> > > >> > > > > > to
>>> > > >> > > > > > > > keep
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > >> necessary
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    characteristics to
>>> > perform
>>> > > >> the
>>> > > >> > > > unclean
>>> > > >> > > > > > > > > recovery.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    1.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       When to perform
>>> the
>>> > > >> unclean
>>> > > >> > > > > recovery
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       2.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       Under different
>>> > config,
>>> > > >> how
>>> > > >> > the
>>> > > >> > > > > > unclean
>>> > > >> > > > > > > > > >> recovery
>>> > > >> > > > > > > > > >> > > > finds
>>> > > >> > > > > > > > > >> > > > >> > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> leader.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       3.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       How the config
>>> > > >> > > > > > > > > unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > and
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > >  unclean.recovery.strategy
>>> > > >> are
>>> > > >> > > > > > converted
>>> > > >> > > > > > > > > when
>>> > > >> > > > > > > > > >> > users
>>> > > >> > > > > > > > > >> > > > >> > > >>>> enable/disable
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       unclean recovery.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       4.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    More details about
>>> how we
>>> > > >> change
>>> > > >> > > > admin
>>> > > >> > > > > > > > client.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    5.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    API limits on the
>>> > > >> > > > > > GetReplicaLogInfoRequest
>>> > > >> > > > > > > > and
>>> > > >> > > > > > > > > >> > > > >> > > >>>> DescribeTopicRequest.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    6.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    Two metrics added
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >    1.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > >
>>> > > >> > Kafka.controller.global_under_min_isr_partition_count
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >       2.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >>
>>> kafka.controller.unclean_recovery_finished_count
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> > On Wed, Sep 13, 2023 at
>>> > > 10:46 AM
>>> > > >> > > Colin
>>> > > >> > > > > > > McCabe <
>>> > > >> > > > > > > > > >> > > > >> > > cmcc...@apache.org>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> On Tue, Sep 12, 2023,
>>> at
>>> > > 17:21,
>>> > > >> > > Calvin
>>> > > >> > > > > Liu
>>> > > >> > > > > > > > > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Hi Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Thanks for the
>>> comments!
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Hi Calvin,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Thanks again for the
>>> KIP.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> One meta-comment: it's
>>> > > usually
>>> > > >> > > better
>>> > > >> > > > to
>>> > > >> > > > > > > just
>>> > > >> > > > > > > > > do a
>>> > > >> > > > > > > > > >> > diff
>>> > > >> > > > > > > > > >> > > > on a
>>> > > >> > > > > > > > > >> > > > >> > > >>>> message
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> spec
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> file or java file if
>>> you're
>>> > > >> > > including
>>> > > >> > > > > > > changes
>>> > > >> > > > > > > > to
>>> > > >> > > > > > > > > >> it
>>> > > >> > > > > > > > > >> > in
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > KIP.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> This is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> easier to read than
>>> looking
>>> > > for
>>> > > >> > "new
>>> > > >> > > > > > fields
>>> > > >> > > > > > > > > begin"
>>> > > >> > > > > > > > > >> > etc.
>>> > > >> > > > > > > > > >> > > > in
>>> > > >> > > > > > > > > >> > > > >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> text, and
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> gracefully handles the
>>> case
>>> > > >> where
>>> > > >> > > > > existing
>>> > > >> > > > > > > > > fields
>>> > > >> > > > > > > > > >> > were
>>> > > >> > > > > > > > > >> > > > >> > changed.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Rewrite the
>>> Additional
>>> > High
>>> > > >> > > > Watermark
>>> > > >> > > > > > > > > >> advancement
>>> > > >> > > > > > > > > >> > > > >> > requirement
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > There was feedback on
>>> > this
>>> > > >> > section
>>> > > >> > > > > that
>>> > > >> > > > > > > some
>>> > > >> > > > > > > > > >> > readers
>>> > > >> > > > > > > > > >> > > > may
>>> > > >> > > > > > > > > >> > > > >> not
>>> > > >> > > > > > > > > >> > > > >> > > be
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> familiar
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > with HWM and
>>> Ack=0,1,all
>>> > > >> > requests.
>>> > > >> > > > > This
>>> > > >> > > > > > > can
>>> > > >> > > > > > > > > help
>>> > > >> > > > > > > > > >> > them
>>> > > >> > > > > > > > > >> > > > >> > > understand
>>> > > >> > > > > > > > > >> > > > >> > > >>>> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > proposal. I will
>>> rewrite
>>> > > this
>>> > > >> > part
>>> > > >> > > > for
>>> > > >> > > > > > > more
>>> > > >> > > > > > > > > >> > > > readability.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> To be clear, I wasn't
>>> > > >> suggesting
>>> > > >> > > > > dropping
>>> > > >> > > > > > > > either
>>> > > >> > > > > > > > > >> > > > section. I
>>> > > >> > > > > > > > > >> > > > >> > > agree
>>> > > >> > > > > > > > > >> > > > >> > > >>>> that
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> they add useful
>>> > background. I
>>> > > >> was
>>> > > >> > > just
>>> > > >> > > > > > > > > suggesting
>>> > > >> > > > > > > > > >> > that
>>> > > >> > > > > > > > > >> > > we
>>> > > >> > > > > > > > > >> > > > >> > should
>>> > > >> > > > > > > > > >> > > > >> > > >>>> discuss
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> the "acks" setting
>>> AFTER
>>> > > >> > discussing
>>> > > >> > > > the
>>> > > >> > > > > > new
>>> > > >> > > > > > > > high
>>> > > >> > > > > > > > > >> > > > watermark
>>> > > >> > > > > > > > > >> > > > >> > > >>>> advancement
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> conditions. We also
>>> should
>>> > > >> discuss
>>> > > >> > > > > acks=0.
>>> > > >> > > > > > > > While
>>> > > >> > > > > > > > > >> it
>>> > > >> > > > > > > > > >> > > isn't
>>> > > >> > > > > > > > > >> > > > >> > > >>>> conceptually
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> much
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> different than acks=1
>>> here,
>>> > > its
>>> > > >> > > > omission
>>> > > >> > > > > > > from
>>> > > >> > > > > > > > > this
>>> > > >> > > > > > > > > >> > > > section
>>> > > >> > > > > > > > > >> > > > >> is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> confusing.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Unclean recovery
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > The plan is to
>>> replace
>>> > the
>>> > > >> > > > > > > > > >> > > > unclean.leader.election.enable
>>> > > >> > > > > > > > > >> > > > >> > with
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > unclean.recovery.strategy.
>>> > > If
>>> > > >> > the
>>> > > >> > > > > > Unclean
>>> > > >> > > > > > > > > >> Recovery
>>> > > >> > > > > > > > > >> > is
>>> > > >> > > > > > > > > >> > > > >> > enabled
>>> > > >> > > > > > > > > >> > > > >> > > >>>> then it
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> deals
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > with the three
>>> options in
>>> > > the
>>> > > >> > > > > > > > > >> > > > unclean.recovery.strategy.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Let’s refine the
>>> Unclean
>>> > > >> > Recovery.
>>> > > >> > > > We
>>> > > >> > > > > > have
>>> > > >> > > > > > > > > >> already
>>> > > >> > > > > > > > > >> > > > taken a
>>> > > >> > > > > > > > > >> > > > >> > > lot of
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > suggestions and I
>>> hope to
>>> > > >> > enhance
>>> > > >> > > > the
>>> > > >> > > > > > > > > >> durability of
>>> > > >> > > > > > > > > >> > > > Kafka
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> next
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> level
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > with this KIP.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> I am OK with doing the
>>> > > unclean
>>> > > >> > > leader
>>> > > >> > > > > > > recovery
>>> > > >> > > > > > > > > >> > > > improvements
>>> > > >> > > > > > > > > >> > > > >> in
>>> > > >> > > > > > > > > >> > > > >> > > >>>> this KIP.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> However, I think we
>>> need to
>>> > > >> really
>>> > > >> > > > work
>>> > > >> > > > > on
>>> > > >> > > > > > > the
>>> > > >> > > > > > > > > >> > > > configuration
>>> > > >> > > > > > > > > >> > > > >> > > >>>> settings.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Configuration
>>> overrides are
>>> > > >> often
>>> > > >> > > > quite
>>> > > >> > > > > > > messy.
>>> > > >> > > > > > > > > For
>>> > > >> > > > > > > > > >> > > > example,
>>> > > >> > > > > > > > > >> > > > >> > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> cases
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> where we have
>>> > log.roll.hours
>>> > > >> and
>>> > > >> > > > > > > > > >> log.roll.segment.ms
>>> > > >> > > > > > > > > >> > ,
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > user
>>> > > >> > > > > > > > > >> > > > >> > > >>>> has to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> remember which one
>>> takes
>>> > > >> > precedence,
>>> > > >> > > > and
>>> > > >> > > > > > it
>>> > > >> > > > > > > is
>>> > > >> > > > > > > > > not
>>> > > >> > > > > > > > > >> > > > obvious.
>>> > > >> > > > > > > > > >> > > > >> > So,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> rather
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> than
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> creating a new
>>> > configuration,
>>> > > >> why
>>> > > >> > > not
>>> > > >> > > > > add
>>> > > >> > > > > > > > > >> additional
>>> > > >> > > > > > > > > >> > > > values
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> "unclean.leader.election.enable"?
>>> > > >> > I
>>> > > >> > > > > think
>>> > > >> > > > > > > this
>>> > > >> > > > > > > > > >> will
>>> > > >> > > > > > > > > >> > be
>>> > > >> > > > > > > > > >> > > > >> simpler
>>> > > >> > > > > > > > > >> > > > >> > > for
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> people
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> to understand, and
>>> simpler
>>> > in
>>> > > >> the
>>> > > >> > > code
>>> > > >> > > > > as
>>> > > >> > > > > > > > well.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> What if we continued
>>> to use
>>> > > >> > > > > > > > > >> > > > "unclean.leader.election.enable"
>>> > > >> > > > > > > > > >> > > > >> > but
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> extended
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> it so that it took a
>>> > string?
>>> > > >> Then
>>> > > >> > > the
>>> > > >> > > > > > string
>>> > > >> > > > > > > > > could
>>> > > >> > > > > > > > > >> > have
>>> > > >> > > > > > > > > >> > > > >> these
>>> > > >> > > > > > > > > >> > > > >> > > >>>> values:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> never
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>     never
>>> automatically do
>>> > an
>>> > > >> > > unclean
>>> > > >> > > > > > leader
>>> > > >> > > > > > > > > >> election
>>> > > >> > > > > > > > > >> > > > under
>>> > > >> > > > > > > > > >> > > > >> > any
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> conditions
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> false / default
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>     only do an unclean
>>> > leader
>>> > > >> > > election
>>> > > >> > > > > if
>>> > > >> > > > > > > > there
>>> > > >> > > > > > > > > >> may
>>> > > >> > > > > > > > > >> > be
>>> > > >> > > > > > > > > >> > > > >> > possible
>>> > > >> > > > > > > > > >> > > > >> > > >>>> data
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> loss
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> true / always
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>     always do an
>>> unclean
>>> > > leader
>>> > > >> > > > election
>>> > > >> > > > > > if
>>> > > >> > > > > > > we
>>> > > >> > > > > > > > > >> can't
>>> > > >> > > > > > > > > >> > > > >> > immediately
>>> > > >> > > > > > > > > >> > > > >> > > >>>> elect a
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> leader
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> It's a bit awkward that
>>> > false
>>> > > >> maps
>>> > > >> > > to
>>> > > >> > > > > > > default
>>> > > >> > > > > > > > > >> rather
>>> > > >> > > > > > > > > >> > > > than to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> never. But
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> this awkwardness
>>> exists if
>>> > we
>>> > > >> use
>>> > > >> > > two
>>> > > >> > > > > > > > different
>>> > > >> > > > > > > > > >> > > > >> configuration
>>> > > >> > > > > > > > > >> > > > >> > > keys
>>> > > >> > > > > > > > > >> > > > >> > > >>>> as
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> well.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> The reason for the
>>> > > awkwardness
>>> > > >> is
>>> > > >> > > that
>>> > > >> > > > > we
>>> > > >> > > > > > > > simply
>>> > > >> > > > > > > > > >> > don't
>>> > > >> > > > > > > > > >> > > > want
>>> > > >> > > > > > > > > >> > > > >> > most
>>> > > >> > > > > > > > > >> > > > >> > > >>>> of the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> people currently
>>> setting
>>> > > >> > > > > > > > > >> > > > >> unclean.leader.election.enable=false
>>> > > >> > > > > > > > > >> > > > >> > to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> get the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> "never" behavior. We
>>> have
>>> > to
>>> > > >> bite
>>> > > >> > > that
>>> > > >> > > > > > > bullet.
>>> > > >> > > > > > > > > >> Better
>>> > > >> > > > > > > > > >> > > to
>>> > > >> > > > > > > > > >> > > > be
>>> > > >> > > > > > > > > >> > > > >> > > clear
>>> > > >> > > > > > > > > >> > > > >> > > >>>> and
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> explicit than hide it.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Another thing that's a
>>> bit
>>> > > >> awkward
>>> > > >> > > is
>>> > > >> > > > > > having
>>> > > >> > > > > > > > two
>>> > > >> > > > > > > > > >> > > > different
>>> > > >> > > > > > > > > >> > > > >> > ways
>>> > > >> > > > > > > > > >> > > > >> > > to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> do
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> unclean leader election
>>> > > >> specified
>>> > > >> > in
>>> > > >> > > > the
>>> > > >> > > > > > > KIP.
>>> > > >> > > > > > > > > You
>>> > > >> > > > > > > > > >> > > > descirbe
>>> > > >> > > > > > > > > >> > > > >> two
>>> > > >> > > > > > > > > >> > > > >> > > >>>> methods:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> simple "choose the last
>>> > > leader"
>>> > > >> > > > method,
>>> > > >> > > > > > and
>>> > > >> > > > > > > > the
>>> > > >> > > > > > > > > >> > > "unclean
>>> > > >> > > > > > > > > >> > > > >> > > recovery
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> manager"
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> method. I understand
>>> why
>>> > you
>>> > > >> did
>>> > > >> > it
>>> > > >> > > > this
>>> > > >> > > > > > way
>>> > > >> > > > > > > > --
>>> > > >> > > > > > > > > >> > "choose
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > last
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> leader" is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> simple, and will help
>>> us
>>> > > >> deliver
>>> > > >> > an
>>> > > >> > > > > > > > > implementation
>>> > > >> > > > > > > > > >> > > > quickly,
>>> > > >> > > > > > > > > >> > > > >> > > while
>>> > > >> > > > > > > > > >> > > > >> > > >>>> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> URM
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> is preferable in the
>>> long
>>> > > >> term. My
>>> > > >> > > > > > > suggestion
>>> > > >> > > > > > > > > >> here is
>>> > > >> > > > > > > > > >> > > to
>>> > > >> > > > > > > > > >> > > > >> > > separate
>>> > > >> > > > > > > > > >> > > > >> > > >>>> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> decision of HOW to do
>>> > unclean
>>> > > >> > leader
>>> > > >> > > > > > > election
>>> > > >> > > > > > > > > from
>>> > > >> > > > > > > > > >> > the
>>> > > >> > > > > > > > > >> > > > >> > decision
>>> > > >> > > > > > > > > >> > > > >> > > of
>>> > > >> > > > > > > > > >> > > > >> > > >>>> WHEN
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> do it.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> So in other words, have
>>> > > >> > > > > > > > > >> > > "unclean.leader.election.enable"
>>> > > >> > > > > > > > > >> > > > >> > specify
>>> > > >> > > > > > > > > >> > > > >> > > >>>> when we
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> do unclean leader
>>> election,
>>> > > and
>>> > > >> > > have a
>>> > > >> > > > > new
>>> > > >> > > > > > > > > >> > > configuration
>>> > > >> > > > > > > > > >> > > > >> like
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> "unclean.recovery.manager.enable"
>>> > > >> > to
>>> > > >> > > > > > > determine
>>> > > >> > > > > > > > > if
>>> > > >> > > > > > > > > >> we
>>> > > >> > > > > > > > > >> > > use
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> > > URM.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Presumably the URM will
>>> > take
>>> > > >> some
>>> > > >> > > time
>>> > > >> > > > > to
>>> > > >> > > > > > > get
>>> > > >> > > > > > > > > >> fully
>>> > > >> > > > > > > > > >> > > > stable,
>>> > > >> > > > > > > > > >> > > > >> so
>>> > > >> > > > > > > > > >> > > > >> > > >>>> this can
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> default to false for a
>>> > while,
>>> > > >> and
>>> > > >> > we
>>> > > >> > > > can
>>> > > >> > > > > > > flip
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > > default to
>>> > > >> > > > > > > > > >> > > > >> > > true
>>> > > >> > > > > > > > > >> > > > >> > > >>>> when
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> we
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> feel ready.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> The URM is somewhat
>>> > > >> > under-described
>>> > > >> > > > > here.
>>> > > >> > > > > > I
>>> > > >> > > > > > > > > think
>>> > > >> > > > > > > > > >> we
>>> > > >> > > > > > > > > >> > > > need a
>>> > > >> > > > > > > > > >> > > > >> > few
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> configurations here
>>> for it.
>>> > > For
>>> > > >> > > > example,
>>> > > >> > > > > > we
>>> > > >> > > > > > > > > need a
>>> > > >> > > > > > > > > >> > > > >> > > configuration to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> specify
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> how long it should wait
>>> > for a
>>> > > >> > broker
>>> > > >> > > > to
>>> > > >> > > > > > > > respond
>>> > > >> > > > > > > > > to
>>> > > >> > > > > > > > > >> > its
>>> > > >> > > > > > > > > >> > > > RPCs
>>> > > >> > > > > > > > > >> > > > >> > > before
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> moving
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> on. We also need to
>>> > > understand
>>> > > >> how
>>> > > >> > > the
>>> > > >> > > > > URM
>>> > > >> > > > > > > > > >> interacts
>>> > > >> > > > > > > > > >> > > with
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > unclean.leader.election.enable=always. I
>>> > > >> > > > > > > > assume
>>> > > >> > > > > > > > > >> that
>>> > > >> > > > > > > > > >> > > with
>>> > > >> > > > > > > > > >> > > > >> > > "always"
>>> > > >> > > > > > > > > >> > > > >> > > >>>> we
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> will
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> just unconditionally
>>> use
>>> > the
>>> > > >> URM
>>> > > >> > > > rather
>>> > > >> > > > > > than
>>> > > >> > > > > > > > > >> choosing
>>> > > >> > > > > > > > > >> > > > >> > randomly.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> But this
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> should be spelled out
>>> in
>>> > the
>>> > > >> KIP.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > DescribeTopicRequest
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >    1.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >    Yes, the plan is
>>> to
>>> > > >> replace
>>> > > >> > the
>>> > > >> > > > > > > > > >> MetadataRequest
>>> > > >> > > > > > > > > >> > > with
>>> > > >> > > > > > > > > >> > > > >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> DescribeTopicRequest
>>> > for
>>> > > >> the
>>> > > >> > > > admin
>>> > > >> > > > > > > > clients.
>>> > > >> > > > > > > > > >> Will
>>> > > >> > > > > > > > > >> > > > check
>>> > > >> > > > > > > > > >> > > > >> > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> details.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Sounds good. But as I
>>> said,
>>> > > you
>>> > > >> > need
>>> > > >> > > > to
>>> > > >> > > > > > > > specify
>>> > > >> > > > > > > > > >> how
>>> > > >> > > > > > > > > >> > > > >> > AdminClient
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> interacts
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> with the new request.
>>> This
>>> > > will
>>> > > >> > > > involve
>>> > > >> > > > > > > adding
>>> > > >> > > > > > > > > >> some
>>> > > >> > > > > > > > > >> > > > fields
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> TopicDescription.java.
>>> And
>>> > > you
>>> > > >> > need
>>> > > >> > > to
>>> > > >> > > > > > > specify
>>> > > >> > > > > > > > > the
>>> > > >> > > > > > > > > >> > > > changes
>>> > > >> > > > > > > > > >> > > > >> to
>>> > > >> > > > > > > > > >> > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> kafka-topics.sh command
>>> > line
>>> > > >> tool.
>>> > > >> > > > > > Otherwise
>>> > > >> > > > > > > > we
>>> > > >> > > > > > > > > >> > cannot
>>> > > >> > > > > > > > > >> > > > use
>>> > > >> > > > > > > > > >> > > > >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> tool to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> see
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> the new information.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> The new requests,
>>> > > >> > > DescribeTopicRequest
>>> > > >> > > > > and
>>> > > >> > > > > > > > > >> > > > >> > > >>>> GetReplicaLogInfoRequest,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> need
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> to have limits placed
>>> on
>>> > them
>>> > > >> so
>>> > > >> > > that
>>> > > >> > > > > > their
>>> > > >> > > > > > > > size
>>> > > >> > > > > > > > > >> > can't
>>> > > >> > > > > > > > > >> > > be
>>> > > >> > > > > > > > > >> > > > >> > > >>>> infinite. We
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> don't want to
>>> propagate the
>>> > > >> > current
>>> > > >> > > > > > problems
>>> > > >> > > > > > > > of
>>> > > >> > > > > > > > > >> > > > >> > MetadataRequest,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> where
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> clients can request
>>> massive
>>> > > >> > > responses
>>> > > >> > > > > that
>>> > > >> > > > > > > can
>>> > > >> > > > > > > > > >> mess
>>> > > >> > > > > > > > > >> > up
>>> > > >> > > > > > > > > >> > > > the
>>> > > >> > > > > > > > > >> > > > >> JVM
>>> > > >> > > > > > > > > >> > > > >> > > when
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> handled.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Adding limits is
>>> simple for
>>> > > >> > > > > > > > > >> GetReplicaLogInfoRequest
>>> > > >> > > > > > > > > >> > --
>>> > > >> > > > > > > > > >> > > > we
>>> > > >> > > > > > > > > >> > > > >> can
>>> > > >> > > > > > > > > >> > > > >> > > >>>> just say
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> that only 2000
>>> partitions
>>> > at
>>> > > a
>>> > > >> > time
>>> > > >> > > > can
>>> > > >> > > > > be
>>> > > >> > > > > > > > > >> requested.
>>> > > >> > > > > > > > > >> > > For
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> DescribeTopicRequest
>>> we can
>>> > > >> > probably
>>> > > >> > > > > just
>>> > > >> > > > > > > > limit
>>> > > >> > > > > > > > > >> to 20
>>> > > >> > > > > > > > > >> > > > topics
>>> > > >> > > > > > > > > >> > > > >> > or
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> something
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> like that, to avoid the
>>> > > >> complexity
>>> > > >> > > of
>>> > > >> > > > > > doing
>>> > > >> > > > > > > > > >> > pagination
>>> > > >> > > > > > > > > >> > > in
>>> > > >> > > > > > > > > >> > > > >> this
>>> > > >> > > > > > > > > >> > > > >> > > KIP.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >    2.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >    I can let the
>>> broker
>>> > > load
>>> > > >> the
>>> > > >> > > ELR
>>> > > >> > > > > > info
>>> > > >> > > > > > > so
>>> > > >> > > > > > > > > >> that
>>> > > >> > > > > > > > > >> > > they
>>> > > >> > > > > > > > > >> > > > can
>>> > > >> > > > > > > > > >> > > > >> > > serve
>>> > > >> > > > > > > > > >> > > > >> > > >>>> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> DescribeTopicRequest
>>> > as
>>> > > >> well.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Yes, it's fine to add
>>> to
>>> > > >> > > > MetadataCache.
>>> > > >> > > > > In
>>> > > >> > > > > > > > fact,
>>> > > >> > > > > > > > > >> > you'll
>>> > > >> > > > > > > > > >> > > > be
>>> > > >> > > > > > > > > >> > > > >> > > loading
>>> > > >> > > > > > > > > >> > > > >> > > >>>> it
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> anyway once it's added
>>> to
>>> > > >> > > > > PartitionImage.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >    3.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >    Yeah, it does not
>>> make
>>> > > >> sense
>>> > > >> > to
>>> > > >> > > > > have
>>> > > >> > > > > > > the
>>> > > >> > > > > > > > > >> topic
>>> > > >> > > > > > > > > >> > id
>>> > > >> > > > > > > > > >> > > if
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> DescribeTopicRequest
>>> > is
>>> > > >> only
>>> > > >> > > used
>>> > > >> > > > > by
>>> > > >> > > > > > > the
>>> > > >> > > > > > > > > >> admin
>>> > > >> > > > > > > > > >> > > > client.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> OK. That makes things
>>> > > simpler.
>>> > > >> We
>>> > > >> > > can
>>> > > >> > > > > > always
>>> > > >> > > > > > > > > >> create a
>>> > > >> > > > > > > > > >> > > new
>>> > > >> > > > > > > > > >> > > > >> API
>>> > > >> > > > > > > > > >> > > > >> > > later
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> (hopefully not in this
>>> > KIP!)
>>> > > to
>>> > > >> > > query
>>> > > >> > > > by
>>> > > >> > > > > > > topic
>>> > > >> > > > > > > > > ID.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Metrics
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > As for overall
>>> cluster
>>> > > health
>>> > > >> > > > > metrics, I
>>> > > >> > > > > > > > think
>>> > > >> > > > > > > > > >> > > > >> under-min-ISR
>>> > > >> > > > > > > > > >> > > > >> > > is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> still
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> a
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > useful one. ELR is
>>> more
>>> > > like
>>> > > >> a
>>> > > >> > > > safety
>>> > > >> > > > > > > belt.
>>> > > >> > > > > > > > > When
>>> > > >> > > > > > > > > >> > the
>>> > > >> > > > > > > > > >> > > > ELR
>>> > > >> > > > > > > > > >> > > > >> is
>>> > > >> > > > > > > > > >> > > > >> > > >>>> used, the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > cluster availability
>>> has
>>> > > >> already
>>> > > >> > > > been
>>> > > >> > > > > > > > > impacted.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Maybe we can have a
>>> > metric
>>> > > to
>>> > > >> > > count
>>> > > >> > > > > the
>>> > > >> > > > > > > > > >> partitions
>>> > > >> > > > > > > > > >> > > that
>>> > > >> > > > > > > > > >> > > > >> > > sum(ISR,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> ELR)
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> <
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> min
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > ISR. What do you
>>> think?
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> How about:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> A.  a metric for the
>>> totoal
>>> > > >> number
>>> > > >> > > of
>>> > > >> > > > > > > > > >> under-min-isr
>>> > > >> > > > > > > > > >> > > > >> > partitions?
>>> > > >> > > > > > > > > >> > > > >> > > We
>>> > > >> > > > > > > > > >> > > > >> > > >>>> don't
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> have that in Apache
>>> Kafka
>>> > at
>>> > > >> the
>>> > > >> > > > moment.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> B. a metric for the
>>> number
>>> > of
>>> > > >> > > unclean
>>> > > >> > > > > > leader
>>> > > >> > > > > > > > > >> > elections
>>> > > >> > > > > > > > > >> > > we
>>> > > >> > > > > > > > > >> > > > >> did
>>> > > >> > > > > > > > > >> > > > >> > > (for
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> simplicity, it can
>>> reset
>>> > to 0
>>> > > >> on
>>> > > >> > > > > > controller
>>> > > >> > > > > > > > > >> restart:
>>> > > >> > > > > > > > > >> > we
>>> > > >> > > > > > > > > >> > > > >> expect
>>> > > >> > > > > > > > > >> > > > >> > > >>>> people to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> monitor the change over
>>> > time
>>> > > >> > anyway)
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> best,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > Yeah, for the ongoing
>>> > > unclean
>>> > > >> > > > > > recoveries,
>>> > > >> > > > > > > > the
>>> > > >> > > > > > > > > >> > > > controller
>>> > > >> > > > > > > > > >> > > > >> can
>>> > > >> > > > > > > > > >> > > > >> > > >>>> keep an
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > accurate count
>>> through
>>> > > >> failover
>>> > > >> > > > > because
>>> > > >> > > > > > > > > >> partition
>>> > > >> > > > > > > > > >> > > > >> > registration
>>> > > >> > > > > > > > > >> > > > >> > > >>>> can
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> indicate
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > whether a recovery is
>>> > > needed.
>>> > > >> > > > However,
>>> > > >> > > > > > for
>>> > > >> > > > > > > > the
>>> > > >> > > > > > > > > >> > > happened
>>> > > >> > > > > > > > > >> > > > >> > ones,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> unless
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> we
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > want to persist the
>>> > number
>>> > > >> > > > somewhere,
>>> > > >> > > > > we
>>> > > >> > > > > > > can
>>> > > >> > > > > > > > > >> only
>>> > > >> > > > > > > > > >> > > > figure
>>> > > >> > > > > > > > > >> > > > >> it
>>> > > >> > > > > > > > > >> > > > >> > > out
>>> > > >> > > > > > > > > >> > > > >> > > >>>> from
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > log.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> > On Tue, Sep 12, 2023
>>> at
>>> > > >> 3:16 PM
>>> > > >> > > > Colin
>>> > > >> > > > > > > > McCabe <
>>> > > >> > > > > > > > > >> > > > >> > > cmcc...@apache.org
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> Also, we should have
>>> > > metrics
>>> > > >> > that
>>> > > >> > > > > show
>>> > > >> > > > > > > what
>>> > > >> > > > > > > > > is
>>> > > >> > > > > > > > > >> > going
>>> > > >> > > > > > > > > >> > > > on
>>> > > >> > > > > > > > > >> > > > >> > with
>>> > > >> > > > > > > > > >> > > > >> > > >>>> regard
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> to
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> eligible replica
>>> set.
>>> > I'm
>>> > > >> not
>>> > > >> > > sure
>>> > > >> > > > > > > exactly
>>> > > >> > > > > > > > > >> what to
>>> > > >> > > > > > > > > >> > > > >> suggest,
>>> > > >> > > > > > > > > >> > > > >> > > but
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> something
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> that could identify
>>> when
>>> > > >> things
>>> > > >> > > are
>>> > > >> > > > > > going
>>> > > >> > > > > > > > > >> wrong in
>>> > > >> > > > > > > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > > clsuter.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> For example, maybe a
>>> > > metric
>>> > > >> for
>>> > > >> > > > > > > partitions
>>> > > >> > > > > > > > > >> > > containing
>>> > > >> > > > > > > > > >> > > > >> > > replicas
>>> > > >> > > > > > > > > >> > > > >> > > >>>> that
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> are
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> ineligible to be
>>> leader?
>>> > > >> That
>>> > > >> > > would
>>> > > >> > > > > > show
>>> > > >> > > > > > > a
>>> > > >> > > > > > > > > >> spike
>>> > > >> > > > > > > > > >> > > when
>>> > > >> > > > > > > > > >> > > > a
>>> > > >> > > > > > > > > >> > > > >> > > broker
>>> > > >> > > > > > > > > >> > > > >> > > >>>> had an
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> unclean restart.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> Ideally, we'd also
>>> have
>>> > a
>>> > > >> > metric
>>> > > >> > > > that
>>> > > >> > > > > > > > > indicates
>>> > > >> > > > > > > > > >> > when
>>> > > >> > > > > > > > > >> > > > an
>>> > > >> > > > > > > > > >> > > > >> > > unclear
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> leader
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> election or a
>>> recovery
>>> > > >> > happened.
>>> > > >> > > > > It's a
>>> > > >> > > > > > > bit
>>> > > >> > > > > > > > > >> tricky
>>> > > >> > > > > > > > > >> > > > >> because
>>> > > >> > > > > > > > > >> > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> simple
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> thing, of tracking
>>> it
>>> > per
>>> > > >> > > > controller,
>>> > > >> > > > > > may
>>> > > >> > > > > > > > be
>>> > > >> > > > > > > > > a
>>> > > >> > > > > > > > > >> bit
>>> > > >> > > > > > > > > >> > > > >> > confusing
>>> > > >> > > > > > > > > >> > > > >> > > >>>> during
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> failovers.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> best,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> Colin
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >>
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> On Tue, Sep 12,
>>> 2023, at
>>> > > >> 14:25,
>>> > > >> > > > Colin
>>> > > >> > > > > > > > McCabe
>>> > > >> > > > > > > > > >> > wrote:
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > Hi Calvin,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > Thanks for the
>>> KIP. I
>>> > > >> think
>>> > > >> > > this
>>> > > >> > > > > is a
>>> > > >> > > > > > > > great
>>> > > >> > > > > > > > > >> > > > >> improvement.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >> Additional High
>>> > > Watermark
>>> > > >> > > > advance
>>> > > >> > > > > > > > > >> requirement
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > Typo: change
>>> "advance"
>>> > > to
>>> > > >> > > > > > "advancement"
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >> A bit recap of
>>> some
>>> > key
>>> > > >> > > > concepts.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > Typo: change
>>> "bit" to
>>> > > >> "quick"
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >> Ack=1/all produce
>>> > > >> request.
>>> > > >> > It
>>> > > >> > > > > > defines
>>> > > >> > > > > > > > when
>>> > > >> > > > > > > > > >> the
>>> > > >> > > > > > > > > >> > > > Kafka
>>> > > >> > > > > > > > > >> > > > >> > > server
>>> > > >> > > > > > > > > >> > > > >> > > >>>> should
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> respond to the
>>> produce
>>> > > >> request
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > I think this
>>> section
>>> > > >> would be
>>> > > >> > > > > clearer
>>> > > >> > > > > > > if
>>> > > >> > > > > > > > we
>>> > > >> > > > > > > > > >> > talked
>>> > > >> > > > > > > > > >> > > > >> about
>>> > > >> > > > > > > > > >> > > > >> > > the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> new
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> high
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > watermark
>>> advancement
>>> > > >> > > requirement
>>> > > >> > > > > > > first,
>>> > > >> > > > > > > > > and
>>> > > >> > > > > > > > > >> > THEN
>>> > > >> > > > > > > > > >> > > > >> talked
>>> > > >> > > > > > > > > >> > > > >> > > >>>> about its
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > impact on acks=0,
>>> > > acks=1,
>>> > > >> and
>>> > > >> > > > > > > >  acks=all.
>>> > > >> > > > > > > > > >> > > > acks=all
>>> > > >> > > > > > > > > >> > > > >> is
>>> > > >> > > > > > > > > >> > > > >> > of
>>> > > >> > > > > > > > > >> > > > >> > > >>>> course
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> the
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > main case we care
>>> > about
>>> > > >> here,
>>> > > >> > > so
>>> > > >> > > > it
>>> > > >> > > > > > > would
>>> > > >> > > > > > > > > be
>>> > > >> > > > > > > > > >> > good
>>> > > >> > > > > > > > > >> > > to
>>> > > >> > > > > > > > > >> > > > >> lead
>>> > > >> > > > > > > > > >> > > > >> > > with
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> that,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > rather than
>>> delving
>>> > into
>>> > > >> the
>>> > > >> > > > > > > > technicalities
>>> > > >> > > > > > > > > >> of
>>> > > >> > > > > > > > > >> > > > acks=0/1
>>> > > >> > > > > > > > > >> > > > >> > > first.
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >> Unclean recovery
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > So, here you are
>>> > > >> introducing
>>> > > >> > a
>>> > > >> > > > new
>>> > > >> > > > > > > > > >> > configuration,
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> >
>>> > > unclean.recovery.strategy.
>>> > > >> > The
>>> > > >> > > > > > > difficult
>>> > > >> > > > > > > > > >> thing
>>> > > >> > > > > > > > > >> > > here
>>> > > >> > > > > > > > > >> > > > is
>>> > > >> > > > > > > > > >> > > > >> > that
>>> > > >> > > > > > > > > >> > > > >> > > >>>> there
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> is a
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > lot of overlap
>>> with
>>> > > >> > > > > > > > > >> > > unclean.leader.election.enable.
>>> > > >> > > > > > > > > >> > > > So
>>> > > >> > > > > > > > > >> > > > >> we
>>> > > >> > > > > > > > > >> > > > >> > > >>>> have 3
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > different
>>> settings for
>>> > > >> > > > > > > > > >> > unclean.recovery.strategy,
>>> > > >> > > > > > > > > >> > > > plus
>>> > > >> > > > > > > > > >> > > > >> 2
>>> > > >> > > > > > > > > >> > > > >> > > >>>> different
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > settings for
>>> > > >> > > > > > > > > unclean.leader.election.enable,
>>> > > >> > > > > > > > > >> > > giving
>>> > > >> > > > > > > > > >> > > > a
>>> > > >> > > > > > > > > >> > > > >> > cross
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> product of
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > 6 different
>>> options.
>>> > The
>>> > > >> > > > following
>>> > > >> > > > > > > > "unclean
>>> > > >> > > > > > > > > >> > > recovery
>>> > > >> > > > > > > > > >> > > > >> > > manager"
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> section
>>> > > >> > > > > > > > > >> > > > >> > > >>>> >> >> >> > on
>>> > > >> > > > > >
>>> > > >> > > > >
>>> > > >> > > >
>>> > > >> > >
>>> > > >> >
>>> > > >>
>>> > > >
>>> > >
>>> >
>>>
>>>
>>> --
>>> David Arthur
>>>
>>

-- 
-David

Reply via email to