I think there are a few different cases here:

1a. We hit an ApiException PREPARING metadata records on the active controller. 
This is normal and expected. For example, someone tried to create a topic that 
already exists. We translate the ApiException to an ApiError and return the 
appropriate error code to the user. (NotControllerException is also included 
here in 1A)

1b. We hit a non-ApiException PREPARING metadata records on the active 
controller. This is not expected. It would be something like we get a 
NullPointerException in a createTopics RPC. In this case, we resign leadership 
and increment your ForceRenounceCount metric.

2. We hit any exception APPLYING metadata records on the active controller. In 
this case, we just want to exit the process. This prevents the (potentially) 
bad records from ever being commited to the log, since the active controller 
applies the records locally before sending them out to its peers.

3. We hit any exception APPLYING metadata records on a standby controller. In 
this case, we want to increment a metric so that we know that this happened. 
But we don't want to exit the process, in case all the controllers are hitting 
the same problem. Remember that the standby controller only ever applies 
committed records.

4a. We hit any exception constructing the MetadataDelta / new MetadataImage on 
the broker. Increment listener-batch-load-error-count.

4b. We hit any exception applying the MetadataDelta / new MetadataImage on the 
broker. Increment publisher-error-count

The KIP is solid on 4a and 4b, but we have no metric here that we can use for 
case #3. The standby controller can't resign because it's not active in the 
first place. So I would suggest rather than having a resignation metric, we 
just have a general controller metadata error metric.

 How do you feel about:

ForceRenounceCount => 
kafka.controller:type=KafkaController,name=MetadataErrorCount
publisher-error-count => metadata-load-error-count
listener-batch-load-error-count => metadata-apply-error-count

best,
Colin


On Tue, Aug 2, 2022, at 18:05, Niket Goel wrote:
> Thanks for taking the time to go over the KIP Colin.
>
> While I agree with both your points about error handling, I think this 
> KIP focuses on just exposing these errors via the proposed metrics and 
> does not alter the error handling behavior on either the brokers or the 
> controllers. The metrics (as proposed) are somewhat independent of that 
> and should just hook into whatever we are doing. If we change the error 
> handling then the metrics should be hooked into the new code as well.
>
> Maybe the point you are trying to make is that having the properties 
> which you mention below will generally make it so that errors are not 
> committed to the log and hence having the metrics for the is sufficient 
> (as the error will likely be transient and recoverable?).
>
> Apologies if I did not understand the intent of your comment.
>
> - Niket
>
>> On Aug 2, 2022, at 3:34 PM, Colin McCabe <cmcc...@apache.org> wrote:
>> 
>> Hi Niket,
>> 
>> Thanks for the KIP -- much appreciated! The new metrics look very useful.
>> 
>> I agree with the proposed error handling for errors on standby controllers 
>> and brokers. For active controllers, I think we should establish two points:
>> 
>> 1. the active controller replays metadata before submitting it to the Raft 
>> quorum
>> 2. metadata replay errors on the active cause the process to exit, prior to 
>> attempting to commit the record
>> 
>> This will allow most of these metadata replay errors to be noticed and NOT 
>> committed to the metadata log, which I think will make things much more 
>> robust. Since the controller process can be restarted very quickly, it 
>> shouldn't be an undue operational burden. (It's true that when in combined 
>> mode, restarts will take longer, but this kind of tradeoff is integral to 
>> combined mode -- you get reduced fault isolation in exchange for the lower 
>> overhead of one fewer JVM process).
>> 
>> best,
>> Colin
>> 
>> 
>> On Mon, Aug 1, 2022, at 18:05, David Arthur wrote:
>>> Thanks, Niket.
>>> 
>>> +1 binding from me
>>> 
>>> -David
>>> 
>>> On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> 
>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I would like to start a vote on KIP-859 which adds some new metrics to 
>>>> KRaft to allow for better visibility into log processing errors.
>>>> 
>>>> KIP 
>>>> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics
>>>>  
>>>> <https://www.google.com/url?q=https://cwiki.apache.org/confluence/display/KAFKA/KIP-859%253A%2BAdd%2BMetadata%2BLog%2BProcessing%2BError%2BRelated%2BMetrics&source=gmail-imap&ust=1660084494000000&usg=AOvVaw3enbkssNlSaDhPrm6faYby>
>>>> Discussion Thread — 
>>>> https://www.google.com/url?q=https://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt&source=gmail-imap&ust=1660084494000000&usg=AOvVaw2R3Sj0u0NlQOG9XHh-Wgzs
>>>> 
>>>> Thanks
>>>> Niket
>>>>

Reply via email to