Requesting permissions to contribute to Apache Kafka

2025-01-23 Thread Kevin Wu
Hello,

I am requesting permissions to contribute to Apache Kafka.
Wiki ID: kw2412
Jira ID: kevinwu2412

Best,
Kevin Wu


[DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-01-27 Thread Kevin Wu
Hey all,

I posted a KIP to monitor broker startup and controlled shutdown on the
controller-side. Here's the link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup

Best,
Kevin Wu


Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-03-15 Thread Kevin Wu
>
> That's an interesting idea. However, I think that's going to be messy and
> difficult for people to use. For example, how would you set up Grafana or
> Datadog to use this? The string could also get extremely long (imagine 1000
> brokers all in startup.)

Hmm... Yeah from what I've read so far setting this up might be kind of
challenging. I'm not seeing that OTEL supports gauges for string values.

I'm still a little confused as to why having a per-broker metric to expose
its state is preferred, but I think this is at least part of the reason?
When drafting this KIP, I was only really considering the scenarios of the
broker's initial metadata load during startup and their controlled
shutdown, which my proposed metrics would cover. However, there are a lot
of other scenarios with fenced brokers which have already started up that
the existing fencedBrokers metric doesn't really give enough information
about from the controller-side, since it just reports the number. For these
scenarios, I don't think my proposed startup/shutdown focused metrics would
be very useful.
I'm on board with the proposed per-broker metric that exposes its state. I
think it would be helpful to enumerate some specific cases though for the
KIP.

On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu  wrote:

> I guess my concern is that the time-based metrics would reset to 0 on
>> every failover (if I understand the proposed implementation correctly).
>> That seems likely to create confusion.
>
> Yeah that makes sense to me. I'm fine with moving towards the approach of
> either (since I don't think we need both):
>
>- Exposing the number of brokers in 1. startup, 2. fenced (what we
>have now), and 3. in controlled shutdown
>- Exposing a per-broker metric reflecting the state of the broker
>(more on this below).
>
> I think it would be useful to have a state for each broker exposed as a
>> metric. I can think of a lot of scenarios where this would be useful to
>> have. I don't think we should have more than one metric per broker though,
>> if we can help it.
>
> Instead of having exactly a per-broker metric which exposes a number that
> maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values
> are a comma-delimited string of the brokers in those states.
> Something along the lines of:
>
>- Metric: name = BrokersNotRegistered, value = "kafka-1"
>- Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2"
>- Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3"
>- Metric: name = BrokersRegisteredRegisteredAndUnfenced, value =
>"kafka-4,kafka-5"
>
> I guess there will be overlap between the second and third metrics, but
> there do exist metrics that expose `Gauge` values.
>
> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu  wrote:
>
>> Hey Colin,
>>
>> Thanks for the review.
>>
>> Regarding the metrics that reflect times: my initial thinking was to
>> indeed have these be "soft state", which would be reset when a controller
>> failover happens.  I'm not sure if it's a big issue if these values get
>> reset though, since a controller failover means brokers in startup would
>> need to register again to the new controller anyways. Since what we're
>> trying to monitor with these metrics is the broker's startup and shutdown
>> statuses from the controller's view, my thinking was that exposing this
>> soft state would be appropriate.
>>
>> There exist metrics that expose other soft state of the controller in
>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here
>> would fit with what exists there. If instead we're updating these metrics
>> based on the metadata log events for registration changes, it looks like
>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
>> we could add a `ControlledShutdownBrokerCount`. For specifically tracking
>> brokers in their initial startup fencing using the log events, I'm not
>> totally sure as of now how we can actually do this from only the
>> information in `BrokerRegistration`. I guess we know a broker is undergoing
>> startup when it's fenced and has an `incarnationId` the controller hasn't
>> seen before in the log?
>>
>> Regarding the per-broker metrics, what are your thoughts about the metric
>> cardinality of this? There was some discussion about having a
>> startup/shutdown time per-broker and I pushed back against it because the
>> number of metrics we expose as a result is the number of brokers in the
>> cluster. Additionally, I don&#

Re: [VOTE] KIP-1131: Improved controller-side monitoring of broker states

2025-03-25 Thread Kevin Wu
Hello all,

I am manually bumping this thread.
Any feedback or votes would be appreciated.

Best regards,
Kevin Wu

On Thu, Mar 13, 2025 at 1:54 PM Kevin Wu  wrote:

> Hello all,
>
> I would like to call a vote for KIP-1131: Improved controller-side
> monitoring of broker states.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Improved+controller-side+monitoring+of+broker+states
>
> Discussion thread:
> https://lists.apache.org/thread/z8cwnksl6op4jfg7j0nwsg9xxsf8mwhh
>
> Thanks for the reviews,
> Kevin Wu
>
>


Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-04-14 Thread Kevin Wu
Hey Colin,

> How about something like this? > 10 = fenced > 20 = controlled shutdown >
30 = active

Yeah, that seems reasonable to me. Thanks for the suggestion.

Kevin



On Mon, Apr 14, 2025 at 12:42 PM Kevin Wu  wrote:

> Thanks for the comments Federico.
>
> > If I understand correctly unfenced == active. In the code we always
> > use the term active, so I think it would be better to use that for the >
> state 0 description.
> I've updated the KIP description to refer to "active".
>
> > You propose creating per-broker metrics indicating their state >
> (BrokerRegistrationState.kafka-X). Can't these new metrics be used to >
> derive broker counters in whatever monitor tool you decide to use? I >
> mean, we wouldn't need to store and provide > ControlledShutdownBrokerCount
> (proposed), FencedBrokerCount > (existing), ActiveBrokerCount (existing).
> Yes, we can use this new metric to derive broker counters, but it's just
> more complicated for the operator to implement. Also, I don't think it's a
> huge issue that there's a slight redundancy here, since deleting the
> existing metrics will lead to compatibility issues with current monitoring
> setups.
>
> On Mon, Apr 14, 2025 at 12:25 PM Kevin Wu  wrote:
>
>> Thanks for the comments Jose.
>> For 1 and 2, I've changed the naming of the metrics to follow your
>> suggestion of tags/attributes. For 3, I made a note as to why we need the
>> maximum. Basically, it's because the map that contains broker contact times
>> we're using as the source for these metrics removes entries when a broker
>> is fenced. Therefore, we need some default value when the entry doesn't
>> exist for the broker, but it is still registered.
>>
>> Thanks,
>> Kevin
>>
>> > Thanks for the improvement Kevin. I got a chance to look at the KIP.
>> >
>> > 1.
>>  kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X
>> >
>> > Can we use tags or attributes instead of different names? For example,
>> > how about
>> kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X
>> > where X is the node id?
>> >
>> > 2.
>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X
>> >
>> > Same here, did you consider using tags or attributes for the node id?
>> >
>> > 3. For the metrics
>> >
>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X,
>> > you mentioned that you will limit the value to the heartbeat timeout.
>> > Why? Wouldn't it be a useful report the entire time since the last
>> > heartbeat? That is more information instead of just reporting the
>> > value up to the heartbeat timeout.
>> >
>> > Thanks,
>> > --
>> > -José
>> >
>>
>> On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu  wrote:
>>
>>> That's an interesting idea. However, I think that's going to be messy
>>>> and difficult for people to use. For example, how would you set up Grafana
>>>> or Datadog to use this? The string could also get extremely long (imagine
>>>> 1000 brokers all in startup.)
>>>
>>> Hmm... Yeah from what I've read so far setting this up might be kind of
>>> challenging. I'm not seeing that OTEL supports gauges for string values.
>>>
>>> I'm still a little confused as to why having a per-broker metric to
>>> expose its state is preferred, but I think this is at least part of the
>>> reason? When drafting this KIP, I was only really considering the scenarios
>>> of the broker's initial metadata load during startup and their controlled
>>> shutdown, which my proposed metrics would cover. However, there are a lot
>>> of other scenarios with fenced brokers which have already started up that
>>> the existing fencedBrokers metric doesn't really give enough information
>>> about from the controller-side, since it just reports the number. For these
>>> scenarios, I don't think my proposed startup/shutdown focused metrics would
>>> be very useful.
>>> I'm on board with the proposed per-broker metric that exposes its state.
>>> I think it would be helpful to enumerate some specific cases though for the
>>> KIP.
>>>
>>> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu  wrote:
>>>
>>>> I guess my concern is that the time-based metrics would reset to 0 on
>>>>> every fa

Re: [VOTE] KIP-1131: Improved controller-side monitoring of broker states

2025-04-28 Thread Kevin Wu
Hey PoAn,

I might be wrong, but I thought you were a committer? If so, are you
willing to change your vote to be binding if the changes look good, so the
KIP is accepted?

Best regards,
Kevin

On Tue, Mar 25, 2025 at 5:29 PM Kevin Wu  wrote:

> Hello all,
>
> I am manually bumping this thread.
> Any feedback or votes would be appreciated.
>
> Best regards,
> Kevin Wu
>
> On Thu, Mar 13, 2025 at 1:54 PM Kevin Wu  wrote:
>
>> Hello all,
>>
>> I would like to call a vote for KIP-1131: Improved controller-side
>> monitoring of broker states.
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Improved+controller-side+monitoring+of+broker+states
>>
>> Discussion thread:
>> https://lists.apache.org/thread/z8cwnksl6op4jfg7j0nwsg9xxsf8mwhh
>>
>> Thanks for the reviews,
>> Kevin Wu
>>
>>


Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-04-14 Thread Kevin Wu
Thanks for the comments Jose.
For 1 and 2, I've changed the naming of the metrics to follow your
suggestion of tags/attributes. For 3, I made a note as to why we need the
maximum. Basically, it's because the map that contains broker contact times
we're using as the source for these metrics removes entries when a broker
is fenced. Therefore, we need some default value when the entry doesn't
exist for the broker, but it is still registered.

Thanks,
Kevin

> Thanks for the improvement Kevin. I got a chance to look at the KIP.
>
> 1.
 kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X
>
> Can we use tags or attributes instead of different names? For example,
> how about
kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X
> where X is the node id?
>
> 2.
kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X
>
> Same here, did you consider using tags or attributes for the node id?
>
> 3. For the metrics
>
kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X,
> you mentioned that you will limit the value to the heartbeat timeout.
> Why? Wouldn't it be a useful report the entire time since the last
> heartbeat? That is more information instead of just reporting the
> value up to the heartbeat timeout.
>
> Thanks,
> --
> -José
>

On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu  wrote:

> That's an interesting idea. However, I think that's going to be messy and
>> difficult for people to use. For example, how would you set up Grafana or
>> Datadog to use this? The string could also get extremely long (imagine 1000
>> brokers all in startup.)
>
> Hmm... Yeah from what I've read so far setting this up might be kind of
> challenging. I'm not seeing that OTEL supports gauges for string values.
>
> I'm still a little confused as to why having a per-broker metric to expose
> its state is preferred, but I think this is at least part of the reason?
> When drafting this KIP, I was only really considering the scenarios of the
> broker's initial metadata load during startup and their controlled
> shutdown, which my proposed metrics would cover. However, there are a lot
> of other scenarios with fenced brokers which have already started up that
> the existing fencedBrokers metric doesn't really give enough information
> about from the controller-side, since it just reports the number. For these
> scenarios, I don't think my proposed startup/shutdown focused metrics would
> be very useful.
> I'm on board with the proposed per-broker metric that exposes its state. I
> think it would be helpful to enumerate some specific cases though for the
> KIP.
>
> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu  wrote:
>
>> I guess my concern is that the time-based metrics would reset to 0 on
>>> every failover (if I understand the proposed implementation correctly).
>>> That seems likely to create confusion.
>>
>> Yeah that makes sense to me. I'm fine with moving towards the approach of
>> either (since I don't think we need both):
>>
>>- Exposing the number of brokers in 1. startup, 2. fenced (what we
>>have now), and 3. in controlled shutdown
>>- Exposing a per-broker metric reflecting the state of the broker
>>(more on this below).
>>
>> I think it would be useful to have a state for each broker exposed as a
>>> metric. I can think of a lot of scenarios where this would be useful to
>>> have. I don't think we should have more than one metric per broker though,
>>> if we can help it.
>>
>> Instead of having exactly a per-broker metric which exposes a number that
>> maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values
>> are a comma-delimited string of the brokers in those states.
>> Something along the lines of:
>>
>>- Metric: name = BrokersNotRegistered, value = "kafka-1"
>>- Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2"
>>- Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3"
>>- Metric: name = BrokersRegisteredRegisteredAndUnfenced, value =
>>"kafka-4,kafka-5"
>>
>> I guess there will be overlap between the second and third metrics, but
>> there do exist metrics that expose `Gauge` values.
>>
>> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu  wrote:
>>
>>> Hey Colin,
>>>
>>> Thanks for the review.
>>>
>>> Regarding the metrics that reflect times: my initial thinking was to
>>> indeed have these be "s

Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-04-14 Thread Kevin Wu
Thanks for the comments Federico.

> If I understand correctly unfenced == active. In the code we always
> use the term active, so I think it would be better to use that for the >
state 0 description.
I've updated the KIP description to refer to "active".

> You propose creating per-broker metrics indicating their state >
(BrokerRegistrationState.kafka-X). Can't these new metrics be used to >
derive broker counters in whatever monitor tool you decide to use? I >
mean, we wouldn't need to store and provide > ControlledShutdownBrokerCount
(proposed), FencedBrokerCount > (existing), ActiveBrokerCount (existing).  Yes,
we can use this new metric to derive broker counters, but it's just more
complicated for the operator to implement. Also, I don't think it's a huge
issue that there's a slight redundancy here, since deleting the existing
metrics will lead to compatibility issues with current monitoring setups.

On Mon, Apr 14, 2025 at 12:25 PM Kevin Wu  wrote:

> Thanks for the comments Jose.
> For 1 and 2, I've changed the naming of the metrics to follow your
> suggestion of tags/attributes. For 3, I made a note as to why we need the
> maximum. Basically, it's because the map that contains broker contact times
> we're using as the source for these metrics removes entries when a broker
> is fenced. Therefore, we need some default value when the entry doesn't
> exist for the broker, but it is still registered.
>
> Thanks,
> Kevin
>
> > Thanks for the improvement Kevin. I got a chance to look at the KIP.
> >
> > 1.
>  kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X
> >
> > Can we use tags or attributes instead of different names? For example,
> > how about
> kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X
> > where X is the node id?
> >
> > 2.
> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X
> >
> > Same here, did you consider using tags or attributes for the node id?
> >
> > 3. For the metrics
> >
> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X,
> > you mentioned that you will limit the value to the heartbeat timeout.
> > Why? Wouldn't it be a useful report the entire time since the last
> > heartbeat? That is more information instead of just reporting the
> > value up to the heartbeat timeout.
> >
> > Thanks,
> > --
> > -José
> >
>
> On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu  wrote:
>
>> That's an interesting idea. However, I think that's going to be messy and
>>> difficult for people to use. For example, how would you set up Grafana or
>>> Datadog to use this? The string could also get extremely long (imagine 1000
>>> brokers all in startup.)
>>
>> Hmm... Yeah from what I've read so far setting this up might be kind of
>> challenging. I'm not seeing that OTEL supports gauges for string values.
>>
>> I'm still a little confused as to why having a per-broker metric to
>> expose its state is preferred, but I think this is at least part of the
>> reason? When drafting this KIP, I was only really considering the scenarios
>> of the broker's initial metadata load during startup and their controlled
>> shutdown, which my proposed metrics would cover. However, there are a lot
>> of other scenarios with fenced brokers which have already started up that
>> the existing fencedBrokers metric doesn't really give enough information
>> about from the controller-side, since it just reports the number. For these
>> scenarios, I don't think my proposed startup/shutdown focused metrics would
>> be very useful.
>> I'm on board with the proposed per-broker metric that exposes its state.
>> I think it would be helpful to enumerate some specific cases though for the
>> KIP.
>>
>> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu  wrote:
>>
>>> I guess my concern is that the time-based metrics would reset to 0 on
>>>> every failover (if I understand the proposed implementation correctly).
>>>> That seems likely to create confusion.
>>>
>>> Yeah that makes sense to me. I'm fine with moving towards the approach
>>> of either (since I don't think we need both):
>>>
>>>- Exposing the number of brokers in 1. startup, 2. fenced (what we
>>>have now), and 3. in controlled shutdown
>>>- Exposing a per-broker metric reflecting the state of the broker
>>>(more on this below).
>>>
>>> I think it would b

Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-04-24 Thread Kevin Wu
Hey Jose,

Yeah, that was an initial discussion point that isn't going to be
implemented. I'll move it to "rejected alternatives" and remove the
"proposed changes" section. Thanks for the feedback.

Best,
Kevin

On Mon, Apr 14, 2025 at 4:31 PM Kevin Wu  wrote:

> Hey Colin,
>
> > How about something like this? > 10 = fenced > 20 = controlled shutdown
> > 30 = active
>
> Yeah, that seems reasonable to me. Thanks for the suggestion.
>
> Kevin
>
>
>
> On Mon, Apr 14, 2025 at 12:42 PM Kevin Wu  wrote:
>
>> Thanks for the comments Federico.
>>
>> > If I understand correctly unfenced == active. In the code we always
>> > use the term active, so I think it would be better to use that for the
>> > state 0 description.
>> I've updated the KIP description to refer to "active".
>>
>> > You propose creating per-broker metrics indicating their state >
>> (BrokerRegistrationState.kafka-X). Can't these new metrics be used to >
>> derive broker counters in whatever monitor tool you decide to use? I >
>> mean, we wouldn't need to store and provide > ControlledShutdownBrokerCount
>> (proposed), FencedBrokerCount > (existing), ActiveBrokerCount (existing).
>> Yes, we can use this new metric to derive broker counters, but it's just
>> more complicated for the operator to implement. Also, I don't think it's a
>> huge issue that there's a slight redundancy here, since deleting the
>> existing metrics will lead to compatibility issues with current monitoring
>> setups.
>>
>> On Mon, Apr 14, 2025 at 12:25 PM Kevin Wu  wrote:
>>
>>> Thanks for the comments Jose.
>>> For 1 and 2, I've changed the naming of the metrics to follow your
>>> suggestion of tags/attributes. For 3, I made a note as to why we need the
>>> maximum. Basically, it's because the map that contains broker contact times
>>> we're using as the source for these metrics removes entries when a broker
>>> is fenced. Therefore, we need some default value when the entry doesn't
>>> exist for the broker, but it is still registered.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> > Thanks for the improvement Kevin. I got a chance to look at the KIP.
>>> >
>>> > 1.
>>>  kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X
>>> >
>>> > Can we use tags or attributes instead of different names? For example,
>>> > how about
>>> kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X
>>> > where X is the node id?
>>> >
>>> > 2.
>>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X
>>> >
>>> > Same here, did you consider using tags or attributes for the node id?
>>> >
>>> > 3. For the metrics
>>> >
>>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X,
>>> > you mentioned that you will limit the value to the heartbeat timeout.
>>> > Why? Wouldn't it be a useful report the entire time since the last
>>> > heartbeat? That is more information instead of just reporting the
>>> > value up to the heartbeat timeout.
>>> >
>>> > Thanks,
>>> > --
>>> > -José
>>> >
>>>
>>> On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu  wrote:
>>>
>>>> That's an interesting idea. However, I think that's going to be messy
>>>>> and difficult for people to use. For example, how would you set up Grafana
>>>>> or Datadog to use this? The string could also get extremely long (imagine
>>>>> 1000 brokers all in startup.)
>>>>
>>>> Hmm... Yeah from what I've read so far setting this up might be kind of
>>>> challenging. I'm not seeing that OTEL supports gauges for string values.
>>>>
>>>> I'm still a little confused as to why having a per-broker metric to
>>>> expose its state is preferred, but I think this is at least part of the
>>>> reason? When drafting this KIP, I was only really considering the scenarios
>>>> of the broker's initial metadata load during startup and their controlled
>>>> shutdown, which my proposed metrics would cover. However, there are a lot
>>>> of other scenarios with fenced brokers which have already started up that
>>>> the existing fencedBrokers metric doesn

Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-02-27 Thread Kevin Wu
>
> I guess my concern is that the time-based metrics would reset to 0 on
> every failover (if I understand the proposed implementation correctly).
> That seems likely to create confusion.

Yeah that makes sense to me. I'm fine with moving towards the approach of
either (since I don't think we need both):

   - Exposing the number of brokers in 1. startup, 2. fenced (what we have
   now), and 3. in controlled shutdown
   - Exposing a per-broker metric reflecting the state of the broker (more
   on this below).

I think it would be useful to have a state for each broker exposed as a
> metric. I can think of a lot of scenarios where this would be useful to
> have. I don't think we should have more than one metric per broker though,
> if we can help it.

Instead of having exactly a per-broker metric which exposes a number that
maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values
are a comma-delimited string of the brokers in those states.
Something along the lines of:

   - Metric: name = BrokersNotRegistered, value = "kafka-1"
   - Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2"
   - Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3"
   - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value =
   "kafka-4,kafka-5"

I guess there will be overlap between the second and third metrics, but
there do exist metrics that expose `Gauge` values.

On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu  wrote:

> Hey Colin,
>
> Thanks for the review.
>
> Regarding the metrics that reflect times: my initial thinking was to
> indeed have these be "soft state", which would be reset when a controller
> failover happens.  I'm not sure if it's a big issue if these values get
> reset though, since a controller failover means brokers in startup would
> need to register again to the new controller anyways. Since what we're
> trying to monitor with these metrics is the broker's startup and shutdown
> statuses from the controller's view, my thinking was that exposing this
> soft state would be appropriate.
>
> There exist metrics that expose other soft state of the controller in
> `QuorumControllerMetrics.java`, and I thought the proposed metrics here
> would fit with what exists there. If instead we're updating these metrics
> based on the metadata log events for registration changes, it looks like
> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
> we could add a `ControlledShutdownBrokerCount`. For specifically tracking
> brokers in their initial startup fencing using the log events, I'm not
> totally sure as of now how we can actually do this from only the
> information in `BrokerRegistration`. I guess we know a broker is undergoing
> startup when it's fenced and has an `incarnationId` the controller hasn't
> seen before in the log?
>
> Regarding the per-broker metrics, what are your thoughts about the metric
> cardinality of this? There was some discussion about having a
> startup/shutdown time per-broker and I pushed back against it because the
> number of metrics we expose as a result is the number of brokers in the
> cluster. Additionally, I don't think the controller can know of a live
> broker that has not attempted to register yet in order to make a metric for
> it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
> In that case, doesn't that make the metric cardinality worse? I think if we
> decide to go that route we should only have states 1, 2, and 3.
>
> Best,
> Kevin Wu
>
> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu  wrote:
>
>> Hey all,
>>
>> I posted a KIP to monitor broker startup and controlled shutdown on the
>> controller-side. Here's the link:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>>
>> Best,
>> Kevin Wu
>>
>


[VOTE] KIP-1131: Improved controller-side monitoring of broker states

2025-03-13 Thread Kevin Wu
Hello all,

I would like to call a vote for KIP-1131: Improved controller-side
monitoring of broker states.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Improved+controller-side+monitoring+of+broker+states

Discussion thread:
https://lists.apache.org/thread/z8cwnksl6op4jfg7j0nwsg9xxsf8mwhh

Thanks for the reviews,
Kevin Wu


RE: Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-02-25 Thread Kevin Wu
Hey Colin,

Thanks for the review.

Regarding the metrics that reflect times: my initial thinking was to indeed
have these be "soft state", which would be reset when a controller failover
happens.  I'm not sure if it's a big issue if these values get reset
though, since a controller failover means brokers in startup would need to
register again to the new controller anyways. Since what we're trying to
monitor with these metrics is the broker's startup and shutdown statuses
from the controller's view, my thinking was that exposing this soft state
would be appropriate.

There exist metrics that expose other soft state of the controller in
`QuorumControllerMetrics.java`, and I thought the proposed metrics here
would fit with what exists there. If instead we're updating these metrics
based on the metadata log events for registration changes, it looks like
`ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
we could add a `ControlledShutdownBrokerCount`. For specifically tracking
brokers in their initial startup fencing using the log events, I'm not
totally sure as of now how we can actually do this from only the
information in `BrokerRegistration`. I guess we know a broker is undergoing
startup when it's fenced and has an `incarnationId` the controller hasn't
seen before in the log?

Regarding the per-broker metrics, what are your thoughts about the metric
cardinality of this? There was some discussion about having a
startup/shutdown time per-broker and I pushed back against it because the
number of metrics we expose as a result is the number of brokers in the
cluster. Additionally, I don't think the controller can know of a live
broker that has not attempted to register yet in order to make a metric for
it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
In that case, doesn't that make the metric cardinality worse? I think if we
decide to go that route we should only have states 1, 2, and 3.

Best,
Kevin Wu


On 2025/02/20 00:22:00 Colin McCabe wrote:
> Hi Kevin,
>
> Thanks for the KIP.
>
> I notice that you have some metrics that reflect times here, such as
LongestPendingStartupTimeMs, LongestPendingControlledShudownTimeMs, etc. I
think this may be difficult to do with complete accuracy because we don't
include times in the metadata log events for registration changes. If we
just do the obvious thing and make the times "soft state" then these times
will be reset when there is a controller failover.
>
> Perhaps it would be simpler to cut out the metrics that include a time
and just have NumberOfBrokersInStartup and
NumberOfBrokersInControlledShutdown ? Then people could set up an alert on
these metrics. For example, set up an alert that fires if
NumberOfBrokersInStartup is non-zero for more than 5 minutes.
>
> I wonder if it would be a good idea to have a per-broker metric on the
controller that showed the state of each broker. Like 0 = not registered, 1
= registered and never unfenced, 2 = registered and fenced, 3 = registered
and unfenced. It obviously would add some more metrics for us to track, but
I think it would be more useful than a bunch of special-purpose metrics...
>
> best,
> Colin
>
>
> On Mon, Jan 27, 2025, at 10:56, Kevin Wu wrote:
> > Hey all,
> >
> > I posted a KIP to monitor broker startup and controlled shutdown on the
> > controller-side. Here's the link:
> >
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
> >
> > Best,
> > Kevin Wu
>


Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

2025-02-25 Thread Kevin Wu
Hey Colin,

Thanks for the review.

Regarding the metrics that reflect times: my initial thinking was to indeed
have these be "soft state", which would be reset when a controller failover
happens.  I'm not sure if it's a big issue if these values get reset
though, since a controller failover means brokers in startup would need to
register again to the new controller anyways. Since what we're trying to
monitor with these metrics is the broker's startup and shutdown statuses
from the controller's view, my thinking was that exposing this soft state
would be appropriate.

There exist metrics that expose other soft state of the controller in
`QuorumControllerMetrics.java`, and I thought the proposed metrics here
would fit with what exists there. If instead we're updating these metrics
based on the metadata log events for registration changes, it looks like
`ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
we could add a `ControlledShutdownBrokerCount`. For specifically tracking
brokers in their initial startup fencing using the log events, I'm not
totally sure as of now how we can actually do this from only the
information in `BrokerRegistration`. I guess we know a broker is undergoing
startup when it's fenced and has an `incarnationId` the controller hasn't
seen before in the log?

Regarding the per-broker metrics, what are your thoughts about the metric
cardinality of this? There was some discussion about having a
startup/shutdown time per-broker and I pushed back against it because the
number of metrics we expose as a result is the number of brokers in the
cluster. Additionally, I don't think the controller can know of a live
broker that has not attempted to register yet in order to make a metric for
it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
In that case, doesn't that make the metric cardinality worse? I think if we
decide to go that route we should only have states 1, 2, and 3.

Best,
Kevin Wu

On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu  wrote:

> Hey all,
>
> I posted a KIP to monitor broker startup and controlled shutdown on the
> controller-side. Here's the link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>
> Best,
> Kevin Wu
>


[DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-07 Thread Kevin Wu
Hello all,

I wrote a KIP to add a generic feature level metric.
Here's the link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric

Thanks,
Kevin Wu


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-07 Thread Kevin Wu
Hey Jose,

Thanks for the response. Yeah, the new metric should expose
metadata.version as well. Let me update the KIP to reflect that.

Thanks,
Kevin Wu

On Wed, May 7, 2025 at 2:54 PM Kevin Wu  wrote:

> Hello all,
>
> I wrote a KIP to add a generic feature level metric.
> Here's the link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>
> Thanks,
> Kevin Wu
>
>
>


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-12 Thread Kevin Wu
Hey Chia-Ping and Justine,

Okay, that makes sense about the minimum version changing at some point.
I'll add these metrics to this KIP. Thanks for the insightful discussion.

Best,
Kevin Wu

On Fri, May 9, 2025 at 4:54 PM Kevin Wu  wrote:

> Hey Chia-Ping and Justine,
>
> Thanks for the explanation. I see where y'all are coming from, but I want
> to make sure I understand how the value of this metric would change.
>
> It seems to me that the supported feature range is determined by the
> software version, so this metric's value should only change when a software
> upgrade/downgrade occurs. Otherwise, the range should not change. Is that
> correct?
>
> Also, if we want to add this metric, we would just have one additional
> metric per feature right, which would be the maximum feature level
> supported, since the minimum is always 0?
>
> Thanks,
> Kevin
>
> On Thu, May 8, 2025 at 6:06 PM Kevin Wu  wrote:
>
>> Hey Chia-Ping,
>>
>> I hadn't considered adding the supported versions for each feature as a
>> metric, but I'm not sure if it's helpful for monitoring the progress of an
>> upgrade/downgrade of a feature. For example, if a node doesn't support a
>> particular feature level we're upgrading to, we shouldn't even be allowed
>> to run the upgrade right? I think that's the case for kraft.version (which
>> might be a special case), but I'm not sure about the other features. The
>> use case for exposing the finalized feature level is that monitoring it
>> across all nodes tells the operator that an upgrade/downgrade of the
>> feature was completed on every node.
>>
>> Best,
>> Kevin Wu
>>
>> On Thu, May 8, 2025 at 9:04 AM Kevin Wu  wrote:
>>
>>> Hey Jun,
>>>
>>> Thanks for the comments.
>>> 1. I'll update the KIP. My trunk is a bit stale.
>>> 2. Yeah, the metric should report the finalized feature level for the
>>> feature. And if it is not set, the metric will report 0.
>>> 3. I'll update the KIP with a timeline.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Wed, May 7, 2025 at 3:10 PM Kevin Wu  wrote:
>>>
>>>> Hey Jose,
>>>>
>>>> Thanks for the response. Yeah, the new metric should expose
>>>> metadata.version as well. Let me update the KIP to reflect that.
>>>>
>>>> Thanks,
>>>> Kevin Wu
>>>>
>>>> On Wed, May 7, 2025 at 2:54 PM Kevin Wu  wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I wrote a KIP to add a generic feature level metric.
>>>>> Here's the link:
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>>>>>
>>>>> Thanks,
>>>>> Kevin Wu
>>>>>
>>>>>
>>>>>


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-08 Thread Kevin Wu
Hey Jun,

Thanks for the comments.
1. I'll update the KIP. My trunk is a bit stale.
2. Yeah, the metric should report the finalized feature level for the
feature. And if it is not set, the metric will report 0.
3. I'll update the KIP with a timeline.

Thanks,
Kevin

On Wed, May 7, 2025 at 3:10 PM Kevin Wu  wrote:

> Hey Jose,
>
> Thanks for the response. Yeah, the new metric should expose
> metadata.version as well. Let me update the KIP to reflect that.
>
> Thanks,
> Kevin Wu
>
> On Wed, May 7, 2025 at 2:54 PM Kevin Wu  wrote:
>
>> Hello all,
>>
>> I wrote a KIP to add a generic feature level metric.
>> Here's the link:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>>
>> Thanks,
>> Kevin Wu
>>
>>
>>


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-09 Thread Kevin Wu
Hey Chia-Ping and Justine,

Thanks for the explanation. I see where y'all are coming from, but I want
to make sure I understand how the value of this metric would change.

It seems to me that the supported feature range is determined by the
software version, so this metric's value should only change when a software
upgrade/downgrade occurs. Otherwise, the range should not change. Is that
correct?

Also, if we want to add this metric, we would just have one additional
metric per feature right, which would be the maximum feature level
supported, since the minimum is always 0?

Thanks,
Kevin

On Thu, May 8, 2025 at 6:06 PM Kevin Wu  wrote:

> Hey Chia-Ping,
>
> I hadn't considered adding the supported versions for each feature as a
> metric, but I'm not sure if it's helpful for monitoring the progress of an
> upgrade/downgrade of a feature. For example, if a node doesn't support a
> particular feature level we're upgrading to, we shouldn't even be allowed
> to run the upgrade right? I think that's the case for kraft.version (which
> might be a special case), but I'm not sure about the other features. The
> use case for exposing the finalized feature level is that monitoring it
> across all nodes tells the operator that an upgrade/downgrade of the
> feature was completed on every node.
>
> Best,
> Kevin Wu
>
> On Thu, May 8, 2025 at 9:04 AM Kevin Wu  wrote:
>
>> Hey Jun,
>>
>> Thanks for the comments.
>> 1. I'll update the KIP. My trunk is a bit stale.
>> 2. Yeah, the metric should report the finalized feature level for the
>> feature. And if it is not set, the metric will report 0.
>> 3. I'll update the KIP with a timeline.
>>
>> Thanks,
>> Kevin
>>
>> On Wed, May 7, 2025 at 3:10 PM Kevin Wu  wrote:
>>
>>> Hey Jose,
>>>
>>> Thanks for the response. Yeah, the new metric should expose
>>> metadata.version as well. Let me update the KIP to reflect that.
>>>
>>> Thanks,
>>> Kevin Wu
>>>
>>> On Wed, May 7, 2025 at 2:54 PM Kevin Wu  wrote:
>>>
>>>> Hello all,
>>>>
>>>> I wrote a KIP to add a generic feature level metric.
>>>> Here's the link:
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>>>>
>>>> Thanks,
>>>> Kevin Wu
>>>>
>>>>
>>>>


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-13 Thread Kevin Wu
Hey Jun,

Thanks for the comments:
4. Maybe I'm missing something, but I think the MetadataLoader is used by
both the broker and controller, so having the one metric works for both
node types. The CurrentMetadataVersion metric is currently reported on both
the broker and controller.
5. What is the best naming practice for additional metrics being added to
existing metrics groups? I'm following the naming convention that is
already in place for these existing metrics objects (MetadataLoaderMetrics
and BrokerServerMetrics), where the former is camel case and the latter is
kebab case.

Best,
Kevin Wu

On Mon, May 12, 2025 at 9:05 AM Kevin Wu  wrote:

> Hey Chia-Ping and Justine,
>
> Okay, that makes sense about the minimum version changing at some point.
> I'll add these metrics to this KIP. Thanks for the insightful discussion.
>
> Best,
> Kevin Wu
>
> On Fri, May 9, 2025 at 4:54 PM Kevin Wu  wrote:
>
>> Hey Chia-Ping and Justine,
>>
>> Thanks for the explanation. I see where y'all are coming from, but I want
>> to make sure I understand how the value of this metric would change.
>>
>> It seems to me that the supported feature range is determined by the
>> software version, so this metric's value should only change when a software
>> upgrade/downgrade occurs. Otherwise, the range should not change. Is that
>> correct?
>>
>> Also, if we want to add this metric, we would just have one additional
>> metric per feature right, which would be the maximum feature level
>> supported, since the minimum is always 0?
>>
>> Thanks,
>> Kevin
>>
>> On Thu, May 8, 2025 at 6:06 PM Kevin Wu  wrote:
>>
>>> Hey Chia-Ping,
>>>
>>> I hadn't considered adding the supported versions for each feature as a
>>> metric, but I'm not sure if it's helpful for monitoring the progress of an
>>> upgrade/downgrade of a feature. For example, if a node doesn't support a
>>> particular feature level we're upgrading to, we shouldn't even be allowed
>>> to run the upgrade right? I think that's the case for kraft.version (which
>>> might be a special case), but I'm not sure about the other features. The
>>> use case for exposing the finalized feature level is that monitoring it
>>> across all nodes tells the operator that an upgrade/downgrade of the
>>> feature was completed on every node.
>>>
>>> Best,
>>> Kevin Wu
>>>
>>> On Thu, May 8, 2025 at 9:04 AM Kevin Wu  wrote:
>>>
>>>> Hey Jun,
>>>>
>>>> Thanks for the comments.
>>>> 1. I'll update the KIP. My trunk is a bit stale.
>>>> 2. Yeah, the metric should report the finalized feature level for the
>>>> feature. And if it is not set, the metric will report 0.
>>>> 3. I'll update the KIP with a timeline.
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> On Wed, May 7, 2025 at 3:10 PM Kevin Wu  wrote:
>>>>
>>>>> Hey Jose,
>>>>>
>>>>> Thanks for the response. Yeah, the new metric should expose
>>>>> metadata.version as well. Let me update the KIP to reflect that.
>>>>>
>>>>> Thanks,
>>>>> Kevin Wu
>>>>>
>>>>> On Wed, May 7, 2025 at 2:54 PM Kevin Wu 
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I wrote a KIP to add a generic feature level metric.
>>>>>> Here's the link:
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin Wu
>>>>>>
>>>>>>
>>>>>>


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-08 Thread Kevin Wu
Hey Chia-Ping,

I hadn't considered adding the supported versions for each feature as a
metric, but I'm not sure if it's helpful for monitoring the progress of an
upgrade/downgrade of a feature. For example, if a node doesn't support a
particular feature level we're upgrading to, we shouldn't even be allowed
to run the upgrade right? I think that's the case for kraft.version (which
might be a special case), but I'm not sure about the other features. The
use case for exposing the finalized feature level is that monitoring it
across all nodes tells the operator that an upgrade/downgrade of the
feature was completed on every node.

Best,
Kevin Wu

On Thu, May 8, 2025 at 9:04 AM Kevin Wu  wrote:

> Hey Jun,
>
> Thanks for the comments.
> 1. I'll update the KIP. My trunk is a bit stale.
> 2. Yeah, the metric should report the finalized feature level for the
> feature. And if it is not set, the metric will report 0.
> 3. I'll update the KIP with a timeline.
>
> Thanks,
> Kevin
>
> On Wed, May 7, 2025 at 3:10 PM Kevin Wu  wrote:
>
>> Hey Jose,
>>
>> Thanks for the response. Yeah, the new metric should expose
>> metadata.version as well. Let me update the KIP to reflect that.
>>
>> Thanks,
>> Kevin Wu
>>
>> On Wed, May 7, 2025 at 2:54 PM Kevin Wu  wrote:
>>
>>> Hello all,
>>>
>>> I wrote a KIP to add a generic feature level metric.
>>> Here's the link:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>>>
>>> Thanks,
>>> Kevin Wu
>>>
>>>
>>>


Re: [DISCUSS] KIP-1180: Add a generic feature level metric

2025-05-15 Thread Kevin Wu
Hi all,

Thanks for all the comments about the metric type field for the minimum and
maximum supported feature levels. I agree they are software version
specific. Also, since they are shared across the broker and controller like
the FinalizedLevel metric, having two separate metrics is redundant and
confusing, so I think we should add another metric group to cover metrics
that are "software version-specific" like José and PoAn mentioned. I think
"NodeMetrics" is a good name for the new metric group.

I still think the FinalizedLevel metric should still reside in the
MetadataLoader metrics, since we derive its value from feature records in
the metadata log.

PY_0: Okay, I will update the section with a link to KIP-1160.

Thanks,
Kevin



On Tue, May 13, 2025 at 8:56 AM Kevin Wu  wrote:

> Hey Jun,
>
> Thanks for the comments:
> 4. Maybe I'm missing something, but I think the MetadataLoader is used by
> both the broker and controller, so having the one metric works for both
> node types. The CurrentMetadataVersion metric is currently reported on both
> the broker and controller.
> 5. What is the best naming practice for additional metrics being added to
> existing metrics groups? I'm following the naming convention that is
> already in place for these existing metrics objects (MetadataLoaderMetrics
> and BrokerServerMetrics), where the former is camel case and the latter is
> kebab case.
>
> Best,
> Kevin Wu
>
> On Mon, May 12, 2025 at 9:05 AM Kevin Wu  wrote:
>
>> Hey Chia-Ping and Justine,
>>
>> Okay, that makes sense about the minimum version changing at some point.
>> I'll add these metrics to this KIP. Thanks for the insightful discussion.
>>
>> Best,
>> Kevin Wu
>>
>> On Fri, May 9, 2025 at 4:54 PM Kevin Wu  wrote:
>>
>>> Hey Chia-Ping and Justine,
>>>
>>> Thanks for the explanation. I see where y'all are coming from, but I
>>> want to make sure I understand how the value of this metric would change.
>>>
>>> It seems to me that the supported feature range is determined by the
>>> software version, so this metric's value should only change when a software
>>> upgrade/downgrade occurs. Otherwise, the range should not change. Is that
>>> correct?
>>>
>>> Also, if we want to add this metric, we would just have one additional
>>> metric per feature right, which would be the maximum feature level
>>> supported, since the minimum is always 0?
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Thu, May 8, 2025 at 6:06 PM Kevin Wu  wrote:
>>>
>>>> Hey Chia-Ping,
>>>>
>>>> I hadn't considered adding the supported versions for each feature as a
>>>> metric, but I'm not sure if it's helpful for monitoring the progress of an
>>>> upgrade/downgrade of a feature. For example, if a node doesn't support a
>>>> particular feature level we're upgrading to, we shouldn't even be allowed
>>>> to run the upgrade right? I think that's the case for kraft.version (which
>>>> might be a special case), but I'm not sure about the other features. The
>>>> use case for exposing the finalized feature level is that monitoring it
>>>> across all nodes tells the operator that an upgrade/downgrade of the
>>>> feature was completed on every node.
>>>>
>>>> Best,
>>>> Kevin Wu
>>>>
>>>> On Thu, May 8, 2025 at 9:04 AM Kevin Wu  wrote:
>>>>
>>>>> Hey Jun,
>>>>>
>>>>> Thanks for the comments.
>>>>> 1. I'll update the KIP. My trunk is a bit stale.
>>>>> 2. Yeah, the metric should report the finalized feature level for the
>>>>> feature. And if it is not set, the metric will report 0.
>>>>> 3. I'll update the KIP with a timeline.
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>> On Wed, May 7, 2025 at 3:10 PM Kevin Wu 
>>>>> wrote:
>>>>>
>>>>>> Hey Jose,
>>>>>>
>>>>>> Thanks for the response. Yeah, the new metric should expose
>>>>>> metadata.version as well. Let me update the KIP to reflect that.
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin Wu
>>>>>>
>>>>>> On Wed, May 7, 2025 at 2:54 PM Kevin Wu 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I wrote a KIP to add a generic feature level metric.
>>>>>>> Here's the link:
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin Wu
>>>>>>>
>>>>>>>
>>>>>>>


[VOTE] KIP-1180: Add generic feature level metrics

2025-05-22 Thread Kevin Wu
Hello all,

I would like to call a vote for KIP-1180: Add generic feature level metrics.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+generic+feature+level+metrics

Discussion thread:
https://lists.apache.org/thread/w4ml9ffkj1j31j8kjpbywq9jsw5ck5sr

Thanks to all for the insightful discussion,
Kevin Wu


[VOTE] KIP-1186: Update AddRaftVoterRequest RPC to support auto-join

2025-06-17 Thread Kevin Wu
Hello all,

I would like to call a vote for KIP-1186: Update AddRaftVoterRequest RPC to
support auto-join.

KIP link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1186%3A+Update+AddRaftVoterRequest+RPC+to+support+auto-join

Discussion thread link:
https://lists.apache.org/thread/ko478l71jf9hqhhg328tcdr46nj2wcz9

Thanks,
Kevin Wu


Re: [DISCUSS] KIP-1186: Update AddRaftVoterRequest RPC to support auto-join

2025-06-17 Thread Kevin Wu
Hi Alyssa,

Thanks for the feedback.
1. Yeah, I guess I do not state explicitly why this issue does not impact
controllers that are manually added via the AdminClient. I'll add a section
to clarify the difference in the situations.
2. I touched on this a bit in the Proposed Changes section, but I agree
with José on the documentation of this field. How the field is set is up to
the implementation, which can change over time (what if the AdminClient
changes in the future to return a response before commitment?), so
documenting how we're using it now is not as accurate as documenting what
changes in the KRaft protocol.

Best,
Kevin Wu

On Thu, Jun 12, 2025 at 11:49 AM Kevin Wu  wrote:

> Hi Jose,
>
> Thanks for the feedback. I agree with the solution of not ignoring the new
> field, and I see how the current documentation is not descriptive in terms
> of what the flag is actually doing within the protocol. I will update the
> KIP to change these things.
>
> Best,
> Kevin
>
> On Wed, Jun 11, 2025 at 2:55 PM Kevin Wu  wrote:
>
>> Hello all,
>>
>> I wrote a KIP to add a new boolean field to the AddRaftVoterRequest RPC.
>> Here is the link:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1186%3A+Update+AddRaftVoterRequest+RPC+to+support+auto-join
>>
>>
>> Thanks,
>> Kevin Wu
>>
>


[DISCUSS] KIP-1186: Update AddRaftVoterRequest RPC to support auto-join

2025-06-11 Thread Kevin Wu
Hello all,

I wrote a KIP to add a new boolean field to the AddRaftVoterRequest RPC.
Here is the link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1186%3A+Update+AddRaftVoterRequest+RPC+to+support+auto-join


Thanks,
Kevin Wu


Re: [DISCUSS] KIP-1186: Update AddRaftVoterRequest RPC to support auto-join

2025-06-12 Thread Kevin Wu
Hi Jose,

Thanks for the feedback. I agree with the solution of not ignoring the new
field, and I see how the current documentation is not descriptive in terms
of what the flag is actually doing within the protocol. I will update the
KIP to change these things.

Best,
Kevin

On Wed, Jun 11, 2025 at 2:55 PM Kevin Wu  wrote:

> Hello all,
>
> I wrote a KIP to add a new boolean field to the AddRaftVoterRequest RPC.
> Here is the link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1186%3A+Update+AddRaftVoterRequest+RPC+to+support+auto-join
>
>
> Thanks,
> Kevin Wu
>


RE: [DISCUSS] KIP-1190: Add a metric for controller thread idleness

2025-07-07 Thread Kevin Wu
Hi Mahsa,

Thanks for the KIP.

In the Motivation section, can we state why the current metrics involving
the controller's event queue thread -- time spent in the queue and process
time -- are not sufficient?
Can we also match the naming style of those other event queue metrics for
consistency (i.e. the type should be ControllerEventManager)?
I think it would also be helpful to explain how this proposed metric will
be monitored by the operator.

Best,
Kevin Wu

On 2025/07/03 20:40:19 Mahsa Seifikar wrote:
> Hello all,
>
> I wrote a short KIP to add a new metric for controller thread idleness.
>
> Here is the link:
>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1190%3A+Add+a+metric+for+controller+thread+idleness
>
> Thanks,
> Mahsa Seifikar
>


Re: [DISCUSS] KIP-1186: Update AddRaftVoterRequest RPC to support auto-join

2025-07-08 Thread Kevin Wu
Hi Jun,

> So, the new controller
> should be able to send a version of the AddRaftVoter request that the
> leader supports, right?

The new controller can send a supported version for the RPC, but we do not
want that to happen. This is because a controller sending AddRaftVoter with
version 0 can cause the unavailability scenario described in the Motivation
section. By making the field not ignorable, the local NetworkClient will
return an unsupported version response without actually sending anything
over the wire.

Best,
Kevin Wu

On Tue, Jun 17, 2025 at 9:40 AM Kevin Wu  wrote:

> Hi Alyssa,
>
> Thanks for the feedback.
> 1. Yeah, I guess I do not state explicitly why this issue does not impact
> controllers that are manually added via the AdminClient. I'll add a section
> to clarify the difference in the situations.
> 2. I touched on this a bit in the Proposed Changes section, but I agree
> with José on the documentation of this field. How the field is set is up to
> the implementation, which can change over time (what if the AdminClient
> changes in the future to return a response before commitment?), so
> documenting how we're using it now is not as accurate as documenting what
> changes in the KRaft protocol.
>
> Best,
> Kevin Wu
>
> On Thu, Jun 12, 2025 at 11:49 AM Kevin Wu  wrote:
>
>> Hi Jose,
>>
>> Thanks for the feedback. I agree with the solution of not ignoring the
>> new field, and I see how the current documentation is not descriptive in
>> terms of what the flag is actually doing within the protocol. I will update
>> the KIP to change these things.
>>
>> Best,
>> Kevin
>>
>> On Wed, Jun 11, 2025 at 2:55 PM Kevin Wu  wrote:
>>
>>> Hello all,
>>>
>>> I wrote a KIP to add a new boolean field to the AddRaftVoterRequest RPC.
>>> Here is the link:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1186%3A+Update+AddRaftVoterRequest+RPC+to+support+auto-join
>>>
>>>
>>> Thanks,
>>> Kevin Wu
>>>
>>


RE: Re: [DISCUSS] KIP-1190: Add a metric for controller thread idleness

2025-07-10 Thread Kevin Wu
Hi Mahsa,

Thanks for the KIP. I think there was an issue with my original reply since
it is not showing up on the thread. Trying again.

In the Motivation section, can we state why the current metrics involving
the controller's event queue thread -- time spent in the queue and process
time -- are not sufficient?
Can we also match the naming style of those other event queue metrics for
consistency (i.e. the type should be ControllerEventManager)?
I think it would also be helpful to explain how this proposed metric will
be monitored by the operator.

Best,
Kevin Wu

On 2025/07/07 20:05:22 Jonah Hooper wrote:
> Thanks for the KIP, Mahsa.
>
> Have one initial question:
>
> > The ratio of time the controller thread is idle relative to the total
time
> > (idle+active).
>
>
> How is the active and idle time calculated? Is it in total over the time
> period in which the controller is active? Or is there a specific window
> period?
>
> Best,
> Jonah Hooper
>
>
> On Thu, Jul 3, 2025 at 4:41 PM Mahsa Seifikar
>  wrote:
>
> > Hello all,
> >
> > I wrote a short KIP to add a new metric for controller thread idleness.
> >
> > Here is the link:
> >
> >
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1190%3A+Add+a+metric+for+controller+thread+idleness
> >
> > Thanks,
> > Mahsa Seifikar
> >
>


Re: [DISCUSS] KIP-1190: Add a metric for controller thread idleness

2025-07-11 Thread Kevin Wu
Hi Mahsa and Jonah,

Since we're adding this new metric to a metrics group that is still using
Yammer, ideally I think we want to use RatioGauge to give us the sampling
functionality we need. It's possible that we can get similar functionality
from Histogram, which I know other Yammer metrics in Kafka use. We are
still able to get gauge metrics from the histogram, as they are the most
straightforward for the operator to monitor (e.g. if the metric value > X,
alert). For example, metrics that are histograms, like EventQueueTimeMs,
are often monitored via their p99 or p999 value.

There are several other "thread-idle-ratio" metrics in Kafka, but those are
all using the newer, internal KafkaMetrics library's sensors.

Best,
Kevin Wu

On Thu, Jul 10, 2025 at 2:09 PM Mahsa Seifikar
 wrote:

> Hi Jonah and Kevin,
>
> Thanks for your comments. I have now updated the KIP to address your
> feedback.
>
> Please let me know if you have any further questions.
>
> Best,
> Mahsa Seifikar
>
> On Thu, Jul 3, 2025 at 4:40 PM Mahsa Seifikar 
> wrote:
>
> > Hello all,
> >
> > I wrote a short KIP to add a new metric for controller thread idleness.
> >
> > Here is the link:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1190%3A+Add+a+metric+for+controller+thread+idleness
> >
> > Thanks,
> > Mahsa Seifikar
> >
> >
> >
>


Re: [VOTE] KIP-1186: Update AddRaftVoterRequest RPC to support auto-join

2025-06-25 Thread Kevin Wu
Hello all,

I am manually bumping this thread.
Any feedback or votes would be appreciated.

Best regards,
Kevin Wu

On Tue, Jun 17, 2025 at 9:55 AM Kevin Wu  wrote:

> Hello all,
>
> I would like to call a vote for KIP-1186: Update AddRaftVoterRequest RPC
> to support auto-join.
>
> KIP link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1186%3A+Update+AddRaftVoterRequest+RPC+to+support+auto-join
>
> Discussion thread link:
> https://lists.apache.org/thread/ko478l71jf9hqhhg328tcdr46nj2wcz9
>
> Thanks,
> Kevin Wu
>


Re: [DISCUSS] KIP-1190: Add a metric for controller thread idleness

2025-07-22 Thread Kevin Wu
Hi Mahsa,

I see you have the definition of the metric value as:
controller idle ratio = idle_time/active_time

Shouldn't the value for a ratio be:
controller idle ratio = idle_time/total_time
where total_time = idle_time + active_time?
This lines up with the definition you outlined earlier in the metric value
description.

Best,
Kevin Wu

On Tue, Jul 22, 2025 at 11:54 AM Mahsa Seifikar
 wrote:

> Thanks Jonah and Kevin for the feedback.
>
> I have updated the KIP accordingly. We ideally want to use something like
> the TimeRatio type for this metric, similar to how "poll-idle-ratio" is
> measured in KafkaRaftMetrics.
>
> Please let me know if you have any further feedback.
>
> Best,
> Mahsa Seifikar
>
> On Fri, Jul 11, 2025 at 4:08 PM Kevin Wu  wrote:
>
> > Hi Mahsa and Jonah,
> >
> > Since we're adding this new metric to a metrics group that is still using
> > Yammer, ideally I think we want to use RatioGauge to give us the sampling
> > functionality we need. It's possible that we can get similar
> functionality
> > from Histogram, which I know other Yammer metrics in Kafka use. We are
> > still able to get gauge metrics from the histogram, as they are the most
> > straightforward for the operator to monitor (e.g. if the metric value >
> X,
> > alert). For example, metrics that are histograms, like EventQueueTimeMs,
> > are often monitored via their p99 or p999 value.
> >
> > There are several other "thread-idle-ratio" metrics in Kafka, but those
> are
> > all using the newer, internal KafkaMetrics library's sensors.
> >
> > Best,
> > Kevin Wu
> >
> > On Thu, Jul 10, 2025 at 2:09 PM Mahsa Seifikar
> >  wrote:
> >
> > > Hi Jonah and Kevin,
> > >
> > > Thanks for your comments. I have now updated the KIP to address your
> > > feedback.
> > >
> > > Please let me know if you have any further questions.
> > >
> > > Best,
> > > Mahsa Seifikar
> > >
> > > On Thu, Jul 3, 2025 at 4:40 PM Mahsa Seifikar 
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I wrote a short KIP to add a new metric for controller thread
> idleness.
> > > >
> > > > Here is the link:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1190%3A+Add+a+metric+for+controller+thread+idleness
> > > >
> > > > Thanks,
> > > > Mahsa Seifikar
> > > >
> > > >
> > > >
> > >
> >
>


Re: [VOTE] KIP-1190: Add a metric for controller thread idleness

2025-07-31 Thread Kevin Wu
Hi Mahsa,

Thanks for the KIP.
+1 (non-binding)

Best,
Kevin Wu

On Thu, Jul 31, 2025 at 3:17 PM Mahsa Seifikar
 wrote:

> Hello all,
>
> I would like to start a vote for KIP-1190: Add a metric for controller
> thread idleness.
>
> KIP link:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1190%3A+Add+a+metric+for+controller+thread+idleness
>
> Discussion thread link:
> https://lists.apache.org/thread/8ky7t8xybgy2omkqld1fbtk16op9p5qo
>
> Thanks,
> Mahsa Seifikar
>


[jira] [Created] (KAFKA-17713) Ensure snapshots are aligned to batch boundaries

2024-10-07 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-17713:


 Summary: Ensure snapshots are aligned to batch boundaries
 Key: KAFKA-17713
 URL: https://issues.apache.org/jira/browse/KAFKA-17713
 Project: Kafka
  Issue Type: Bug
Reporter: Kevin Wu


In the case of a metadata transaction that is started in the middle of the 
record batch, the records preceding the BeginTransactionRecord in the batch are 
flushed. This means the SnapshotGenerator can emit a snapshot whose fetch 
offset (lastContainedOffset + 1) "X" is not aligned to a batch boundary. This 
can result in followers/observers that fetch and apply this bad snapshot to be 
unable to successfully fetch thereafter since the follower will reject the 
fetched record batch that contains offset X (starting at offset X - M), since X 
- M < X.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17713) Ensure snapshots are aligned to batch boundaries

2024-10-10 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-17713.
--
Resolution: Done

[https://github.com/apache/kafka/pull/17398] merged.

> Ensure snapshots are aligned to batch boundaries
> 
>
> Key: KAFKA-17713
> URL: https://issues.apache.org/jira/browse/KAFKA-17713
> Project: Kafka
>  Issue Type: Bug
>        Reporter: Kevin Wu
>    Assignee: Kevin Wu
>Priority: Major
>
> In the case of a metadata transaction that is started in the middle of the 
> record batch, the records preceding the BeginTransactionRecord in the batch 
> are flushed. This means the SnapshotGenerator can emit a snapshot whose fetch 
> offset (lastContainedOffset + 1) "X" is not aligned to a batch boundary. This 
> can result in followers/observers that fetch and apply this bad snapshot to 
> be unable to successfully fetch thereafter since the follower will reject the 
> fetched record batch that contains offset X (starting at offset X - M), since 
> X - M < X.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17916) Convert Kafka Connect system tests to use KRaft

2024-11-05 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-17916.
--
Resolution: Fixed

> Convert Kafka Connect system tests to use KRaft
> ---
>
> Key: KAFKA-17916
> URL: https://issues.apache.org/jira/browse/KAFKA-17916
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect, system tests
>Affects Versions: 4.0.0
>Reporter: Kevin Wu
>Assignee: Kevin Wu
>Priority: Blocker
>
> The dynamic logging and broker compatibility tests in 
> connected_distributed_test.py and some of file source and sink tests in 
> connect_test.py are still using ZK since they do not specify a 
> metadata_quorum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17916) Convert Kafka Connect system tests to use KRaft

2024-10-31 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-17916:


 Summary: Convert Kafka Connect system tests to use KRaft
 Key: KAFKA-17916
 URL: https://issues.apache.org/jira/browse/KAFKA-17916
 Project: Kafka
  Issue Type: Improvement
  Components: connect, system tests
Affects Versions: 4.0.0
Reporter: Kevin Wu


The dynamic logging and broker compatibility tests in 
connected_distributed_test.py and some of file source and sink tests in 
connect_test.py are still using ZK since they do not specify a metadata_quorum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17915) Convert Kafka Client tests to use KRaft

2024-10-31 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-17915:


 Summary: Convert Kafka Client tests to use KRaft
 Key: KAFKA-17915
 URL: https://issues.apache.org/jira/browse/KAFKA-17915
 Project: Kafka
  Issue Type: Improvement
  Components: clients, system tests
Affects Versions: 4.0.0
Reporter: Kevin Wu


Need to update quota, truncation, and client compatibility tests to use KRaft. 
Tests that do not inject a metadata_quorum argument are defaulted to using 
Zookeeper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17917) Convert Kafka core system tests to use KRaft

2024-10-31 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-17917:


 Summary: Convert Kafka core system tests to use KRaft
 Key: KAFKA-17917
 URL: https://issues.apache.org/jira/browse/KAFKA-17917
 Project: Kafka
  Issue Type: Improvement
  Components: core, system tests
Affects Versions: 4.0.0
Reporter: Kevin Wu


The downgrade, group mode transactions, security rolling upgrade, and 
throttling test should be migrated to using KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-17803) Reconcile Differences in MockLog and KafkaMetadataLog `read` Implementation

2024-10-15 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-17803:


 Summary: Reconcile Differences in MockLog and KafkaMetadataLog 
`read` Implementation
 Key: KAFKA-17803
 URL: https://issues.apache.org/jira/browse/KAFKA-17803
 Project: Kafka
  Issue Type: Improvement
Reporter: Kevin Wu


Calling MockLog or KafkaMetadataLog's read method for a given startOffset 
returns a LogOffsetMetadata object that contains an offset field. In the case 
of MockLog, this offset field is the base offset of the record batch which 
contains startOffset.

However, in KafkaMetadataLog, this offset field is set to the given 
startOffset. If the given startOffset is in the middle of a batch, the returned 
LogOffsetMetadata will have an offset that does not match the file position of 
the returned batch. This makes the javadoc for LogSegment#read inaccurate in 
this case since startOffset is not a lower bound (the base offset of the batch 
containing startOffset is the lower bound). 

The discussed approach was to change MockLog to behave the same way as 
KafkaMetadataLog, since this would be safer than changing the semantics of the 
read call in UnifiedLog.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14619) KRaft validate snapshot id are at batch boundries

2024-12-09 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-14619.
--
Resolution: Fixed

> KRaft validate snapshot id are at batch boundries
> -
>
> Key: KAFKA-14619
> URL: https://issues.apache.org/jira/browse/KAFKA-14619
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Reporter: José Armando García Sancio
>Assignee: Kevin Wu
>Priority: Major
> Fix For: 4.0.0
>
>
> When the state machine creates a snapshot, kraft should validate that the 
> provided offset lands at a record batch boundaries. This is required because 
> the current log layer and replication protocol do not handle the case where 
> the snapshot id points to the middle of a record batch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17625) Remove ZK from ducktape in 4.0

2025-01-30 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-17625.
--
Resolution: Fixed

This work was completed in the following PRs:
 * [https://github.com/apache/kafka/pull/17669]
 * [https://github.com/apache/kafka/pull/17638]
 * [https://github.com/apache/kafka/pull/17689]
 * [https://github.com/apache/kafka/pull/17847]
 * [https://github.com/apache/kafka/pull/18367]
 * The PRs listed in this issue: 
https://issues.apache.org/jira/browse/KAFKA-17609

> Remove ZK from ducktape in 4.0
> --
>
> Key: KAFKA-17625
> URL: https://issues.apache.org/jira/browse/KAFKA-17625
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Colin McCabe
>    Assignee: Kevin Wu
>Priority: Major
>
> This change will be done in a couple PRs: 
>  * The first to remove existing ZK test parameterizations from ducktape (Done)
>  * Need to migrate existing tests that still use ZK through a default 
> parametrization to KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18679) KafkaRaftMetrics metrics are exposing doubles instead of integers

2025-01-30 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-18679:


 Summary: KafkaRaftMetrics metrics are exposing doubles instead of 
integers
 Key: KAFKA-18679
 URL: https://issues.apache.org/jira/browse/KAFKA-18679
 Project: Kafka
  Issue Type: Bug
Reporter: Kevin Wu


The following metrics are being exposed as floating point doubles instead of 
ints/longs:
 * log-end-offset
 * log-end-epoch
 * number-unkown-voter-connections
 * current-leader
 * current-vote
 * current-epoch
 * high-watermark

This issue extends to a lot of other metrics, which may be intending to report 
only integer/long values, but are instead reporting doubles.

 

Link to GH discussion detailing issue further: 
https://github.com/apache/kafka/pull/18304#discussion_r1934364595



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18305) validate controller.listener.names is not in inter.broker.listener.name for kcontrollers

2024-12-18 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-18305:


 Summary: validate controller.listener.names is not in 
inter.broker.listener.name for kcontrollers
 Key: KAFKA-18305
 URL: https://issues.apache.org/jira/browse/KAFKA-18305
 Project: Kafka
  Issue Type: Task
Reporter: Kevin Wu


Make an exception for when `inter.broker.listener.name` is not set and is 
instead inferred from the security protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-18305) validate controller.listener.names is not in inter.broker.listener.name for kcontrollers

2024-12-20 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-18305.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> validate controller.listener.names is not in inter.broker.listener.name for 
> kcontrollers
> 
>
> Key: KAFKA-18305
> URL: https://issues.apache.org/jira/browse/KAFKA-18305
> Project: Kafka
>  Issue Type: Task
>    Reporter: Kevin Wu
>Assignee: Kevin Wu
>Priority: Major
> Fix For: 4.0.0
>
>
> Make an exception for when `inter.broker.listener.name` is not set and is 
> instead inferred from the security protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17917) Convert Kafka core system tests to use KRaft

2024-11-21 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-17917.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> Convert Kafka core system tests to use KRaft
> 
>
> Key: KAFKA-17917
> URL: https://issues.apache.org/jira/browse/KAFKA-17917
> Project: Kafka
>  Issue Type: Improvement
>  Components: core, system tests
>Affects Versions: 4.0.0
>Reporter: Kevin Wu
>Assignee: Kevin Wu
>Priority: Blocker
> Fix For: 4.0.0
>
>
> The downgrade, group mode transactions, security rolling upgrade, and 
> throttling test should be migrated to using KRaft. The network degrade test 
> should be refactored to use KafkaService rather than ZookeeperService.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18667) Add ducktape tests for simultaneous broker + controller failure

2025-01-29 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-18667:


 Summary: Add ducktape tests for simultaneous broker + controller 
failure
 Key: KAFKA-18667
 URL: https://issues.apache.org/jira/browse/KAFKA-18667
 Project: Kafka
  Issue Type: Task
Reporter: Kevin Wu
Assignee: Kevin Wu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18666) Controller-side monitoring for broker shutdown and startup

2025-01-29 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-18666:


 Summary: Controller-side monitoring for broker shutdown and startup
 Key: KAFKA-18666
 URL: https://issues.apache.org/jira/browse/KAFKA-18666
 Project: Kafka
  Issue Type: New Feature
Reporter: Kevin Wu
Assignee: Kevin Wu


KIP link: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18395) Initialize KafkaRaftMetrics without QuorumState to prevent circularity

2025-01-02 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-18395:


 Summary: Initialize KafkaRaftMetrics without QuorumState to 
prevent circularity
 Key: KAFKA-18395
 URL: https://issues.apache.org/jira/browse/KAFKA-18395
 Project: Kafka
  Issue Type: Improvement
Reporter: Kevin Wu
Assignee: José Armando García Sancio


To implement https://issues.apache.org/jira/browse/KAFKA-16524, `QuorumState` 
needs to be removed from `KafkaRaftMetrics` constructor to avoid a circularity. 
That PR's approach is to move `QuorumState` to a `KafkaRaftMetrics#initialize` 
method for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-17431) KRaft servers require valid static socketserver configuration to start

2025-03-19 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-17431.
--
Fix Version/s: 4.1.0
   Resolution: Fixed

> KRaft servers require valid static socketserver configuration to start
> --
>
> Key: KAFKA-17431
> URL: https://issues.apache.org/jira/browse/KAFKA-17431
> Project: Kafka
>  Issue Type: Bug
>Reporter: Colin McCabe
>    Assignee: Kevin Wu
>Priority: Major
> Fix For: 4.1.0
>
>
> KRaft servers require a valid static socketserver configuration to start. 
> However, it would be better if we could support invalid static 
> configurations, as long as there were dynamically set changes that made them 
> valid. This will require reworking startup somewhat so that we start the 
> socket server later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-19228) `kafka-storage format` should not support explicitly setting kraft.version feature level

2025-05-06 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-19228.
--
Resolution: Won't Fix

Closing with won't fix, since although the current behavior is undesirable UX, 
this fix would be a breaking change.

 

Since verifying that setting `–feature=kraft.version=X` alongside any of 
`–standalone, --initial-controllers, --no-initial-controllers` does not write 
anything to `bootstrap.checkpoint` or the `0-0.checkpoint` files, the current 
behavior is consistent (i.e. the metadata log is always formatted correctly.

> `kafka-storage format` should not support explicitly setting kraft.version 
> feature level
> 
>
> Key: KAFKA-19228
> URL: https://issues.apache.org/jira/browse/KAFKA-19228
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kevin Wu
>Assignee: Kevin Wu
>Priority: Major
>
> When formatting, explicitly setting the kraft.version feature level with 
> --feature kraft.version=X should not be supported. Instead, this feature's 
> level should be inferred from the presence/absence of the following flags: 
> --standalone, --initial-controllers, --no-initial-controllers.
>  * When --standalone or --initial-controllers is specified, this node is 
> using kraft.version=1, and will write a bootstrap snapshot with the KRaft 
> version and voter set control records.
>  * When --no-initial-controllers is specified, the feature level will end up 
> unset because it is not used to write the bootstrap snapshot like with 
> --initial-controllers and --standalone. Instead, the node will default to 
> kraft.version 0 and potentially discover a higher feature level by fetching 
> the log.
>  * If none of these flags are specified, the static config 
> controller.quorum.voters must be defined.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-19228) Formatting with `--no-initial-controllers` flag should not write kraft version control record

2025-05-01 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-19228:


 Summary: Formatting with `--no-initial-controllers` flag should 
not write kraft version control record
 Key: KAFKA-19228
 URL: https://issues.apache.org/jira/browse/KAFKA-19228
 Project: Kafka
  Issue Type: Bug
Reporter: Kevin Wu
Assignee: Kevin Wu


The current implementation of this flag will write a kraft version control 
record to the node's log with a value of 1. However, this is not exactly 
correct. KRaft version 1 means the voter set is discoverable from the log, but 
here we have KRaft version 1 but no voter set yet on this node.

The intention of setting this flag is to indicate the cluster is bootstrapped 
with a voter set, and therefore should essentially be a no-op (i.e. it should 
not write this version record). This is important because this means a cluster 
with kraft version 0 that discovers the voter set via the 
`controller.quorum.voters` static config can also be formatted with this flag 
set without throwing an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-18956) Enable junit tests to optionally use more than one KRaft controller

2025-03-11 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-18956:


 Summary: Enable junit tests to optionally use more than one KRaft 
controller
 Key: KAFKA-18956
 URL: https://issues.apache.org/jira/browse/KAFKA-18956
 Project: Kafka
  Issue Type: Task
Reporter: Kevin Wu


Currently, the junit tests create just one `controllerServer`. Enabling the 
test framework to support multiple `controllerServers` would allow for junit 
tests to test behavior during KRaft leadership changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-18667) Add ducktape tests for simultaneous broker + controller failure

2025-04-03 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-18667.
--
Resolution: Fixed

> Add ducktape tests for simultaneous broker + controller failure
> ---
>
> Key: KAFKA-18667
> URL: https://issues.apache.org/jira/browse/KAFKA-18667
> Project: Kafka
>  Issue Type: Task
>        Reporter: Kevin Wu
>    Assignee: Kevin Wu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-19254) Add generic feature level metric

2025-05-07 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-19254:


 Summary: Add generic feature level metric
 Key: KAFKA-19254
 URL: https://issues.apache.org/jira/browse/KAFKA-19254
 Project: Kafka
  Issue Type: New Feature
Reporter: Kevin Wu
Assignee: Kevin Wu


KIP link: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180%3A+Add+a+generic+feature+level+metric



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-19255) KRaft request manager should support one in-flight request per request type

2025-05-08 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-19255:


 Summary: KRaft request manager should support one in-flight 
request per request type
 Key: KAFKA-19255
 URL: https://issues.apache.org/jira/browse/KAFKA-19255
 Project: Kafka
  Issue Type: Improvement
Reporter: Kevin Wu
Assignee: Kevin Wu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-19400) Update AddRaftVoterRPC to support controller auto-joining

2025-06-11 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-19400:


 Summary: Update AddRaftVoterRPC to support controller auto-joining
 Key: KAFKA-19400
 URL: https://issues.apache.org/jira/browse/KAFKA-19400
 Project: Kafka
  Issue Type: Improvement
Reporter: Kevin Wu
Assignee: Kevin Wu
 Fix For: 4.2.0


When AddRaftVoterRPCs are sent as part of auto-joining, the active controller 
should send a response after the new voter set has been appended to only its 
own log. This allows the auto-joining replica to fetch the new voter set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-18666) Controller-side monitoring for broker shutdown and startup

2025-06-11 Thread Kevin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wu resolved KAFKA-18666.
--
Resolution: Fixed

> Controller-side monitoring for broker shutdown and startup
> --
>
> Key: KAFKA-18666
> URL: https://issues.apache.org/jira/browse/KAFKA-18666
> Project: Kafka
>  Issue Type: New Feature
>        Reporter: Kevin Wu
>    Assignee: Kevin Wu
>Priority: Major
> Fix For: 4.1.0
>
>
> KIP link: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-19489) storage tool should check controller.quorum.voters is not set alongside a dynamic quorum flag when formatting

2025-07-09 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-19489:


 Summary: storage tool should check controller.quorum.voters is not 
set alongside a dynamic quorum flag when formatting
 Key: KAFKA-19489
 URL: https://issues.apache.org/jira/browse/KAFKA-19489
 Project: Kafka
  Issue Type: Bug
Reporter: Kevin Wu


The storage tool allows for setting both the static voters config 
({{{}controller.quorum.voters{}}}) and attempting to format with one of 
{{--standalone, --initial-controllers, --no-initial-controllers}} on a 
controller, but instead it should throw an exception. This is because setting 
{{controller.quorum.voters}} itself is formatting the voter set.

Setting {{controller.quorum.voters}} while trying to format with a 
{{--standalone}} and {{--no-initial-controllers}} setup can result in 2 voter 
sets. For example, in a three node setup, the two nodes that formatted with 
{{--no-initial-controllers}} could form quorum with each other since they have 
the static voter set, and the {{--standalone}} node would ignore the config and 
read the voter set of itself from its log, forming its own quorum of 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-19497) Topic replay code does not handle creation and deletion properly if it occurs in the same batch

2025-07-11 Thread Kevin Wu (Jira)
Kevin Wu created KAFKA-19497:


 Summary: Topic replay code does not handle creation and deletion 
properly if it occurs in the same batch
 Key: KAFKA-19497
 URL: https://issues.apache.org/jira/browse/KAFKA-19497
 Project: Kafka
  Issue Type: Bug
Reporter: Kevin Wu
Assignee: Kevin Wu
 Fix For: 4.2.0


There is a small logic bug in topic replay. If a topic is created and then 
removed before the TopicsDelta is applied, we end up with the deleted topic in 
{{createdTopics}} on the delta but not in deletedTopicIds. I think we are 
extremely unlikely to see this since MetadataLoader will apply the delta for 
each batch of records it receives. Since it’s impossible to see a TopicRecord 
and RemoveTopicRecord in the same batch, the only way this could surface is if 
MetadataLoader did some buffering.

{{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)