Re: [DISCUSS] PIP-231: Add metric for topic load failed

Jiuming Tao Mon, 19 Dec 2022 07:18:54 -0800

Hi Asaf,

>   1. When a topic fails to load, what happens exactly at that stage? Does
>   it switch over to another broker? Is there a retry on the current broker?


There are 2 possibilities:
1. Create a new topic by CLI Tools, Admin API or Producer/Consumer auto create
In this condition, if a topic load failed, the client will receives an 
exception, and the 
client decides to retry or not.
2. The topic unloaded by LoadManager, then reload on another broker
In this condition, there are 2 possibilities too:
    1. All the producers/consumers closed by user, then re-create the 
producers/consumers, it will behaves like the above. The new broker serviceURL 
will updated by LoadManager, and the LookUpResult will updated too.
    2. User didn’t actively close all the producers/consumers, then, the client 
will try to reconnect to the broker. The old broker will send a 
TopicMigratedCommand to the client, it contains the new broker serviceURL and 
port. There is a timer task, client will retry to connect to the broker. And 
the topic will retry to load.


>   2. Is there a way to provide a gauge showing how many topics are
>   unloaded due to non-recoverable error? Or maybe topics unloaded but are
>   currently retrying?


In general, the failure to load the topic is temporary. When the network is 
restored, the topic will eventually load successfully. Unless the metadata 
store or bookkeeper completely unavailable.

>   3. I presume granularity is per broker. The scenario is: I setup an
>   alert on rate of this counter > 0 ? Could it be a transient error thus I’m
>   alerted on something less important? How do you see this in production I
>   guess is the question.



Yes, it is a broker metric, like `topic_load_times`. I think the alarm value 
can be set a little higher. 
An occasional failure or two may only be temporary, but if there are too many 
topic load failures in a short time(say, 20 times in 5 mins), this may indicate 
that the cluster has a relatively big problem during this period(say, the 
network is unavailable)


Thanks,
Tao Jiuming


> 2022年12月19日 17:17，Asaf Mesika <asaf.mes...@gmail.com> 写道：
> 
> I have several context-related questions:
> 
> 
>   1. When a topic fails to load, what happens exactly at that stage? Does
>   it switch over to another broker? Is there a retry on the current broker?
>   2. Is there a way to provide a gauge showing how many topics are
>   unloaded due to non-recoverable error? Or maybe topics unloaded but are
>   currently retrying?
>   3. I presume granularity is per broker. The scenario is: I setup an
>   alert on rate of this counter > 0 ? Could it be a transient error thus I’m
>   alerted on something less important? How do you see this in production I
>   guess is the question.
> 
> 
> Thanks!
> 
> Asaf
> 
> On 19 Dec 2022 at 10:19:39, Jiuming Tao <jm...@streamnative.io.invalid>
> wrote:
> 
>> Hello pulsar community,
>> 
>> I've opened `PIP-231: Add metric for topic load failed` to discuss.
>> 
>> Motivation:
>> Currently, we have topic_load_times
>> <https://pulsar.apache.org/docs/next/reference-metrics/#broker-metrics>
>> metric
>> to tracking how long a topic load succeed.
>> But when loading a topic, there are may have some chances the topic load
>> failed due to MetadataStore or Bookkeeper, and we don't have related
>> metrics to track it.
>> 
>> For more details, please read the PIP at
>> https://github.com/apache/pulsar/issues/18979
>> I'm looking forward to hearing what you think.
>> 
>> Thanks,
>> Tao Jiuming
>>

Re: [DISCUSS] PIP-231: Add metric for topic load failed

Reply via email to