Hi Asaf, > 1. When a topic fails to load, what happens exactly at that stage? Does > it switch over to another broker? Is there a retry on the current broker?
There are 2 possibilities: 1. Create a new topic by CLI Tools, Admin API or Producer/Consumer auto create In this condition, if a topic load failed, the client will receives an exception, and the client decides to retry or not. 2. The topic unloaded by LoadManager, then reload on another broker In this condition, there are 2 possibilities too: 1. All the producers/consumers closed by user, then re-create the producers/consumers, it will behaves like the above. The new broker serviceURL will updated by LoadManager, and the LookUpResult will updated too. 2. User didn’t actively close all the producers/consumers, then, the client will try to reconnect to the broker. The old broker will send a TopicMigratedCommand to the client, it contains the new broker serviceURL and port. There is a timer task, client will retry to connect to the broker. And the topic will retry to load. > 2. Is there a way to provide a gauge showing how many topics are > unloaded due to non-recoverable error? Or maybe topics unloaded but are > currently retrying? In general, the failure to load the topic is temporary. When the network is restored, the topic will eventually load successfully. Unless the metadata store or bookkeeper completely unavailable. > 3. I presume granularity is per broker. The scenario is: I setup an > alert on rate of this counter > 0 ? Could it be a transient error thus I’m > alerted on something less important? How do you see this in production I > guess is the question. Yes, it is a broker metric, like `topic_load_times`. I think the alarm value can be set a little higher. An occasional failure or two may only be temporary, but if there are too many topic load failures in a short time(say, 20 times in 5 mins), this may indicate that the cluster has a relatively big problem during this period(say, the network is unavailable) Thanks, Tao Jiuming > 2022年12月19日 17:17,Asaf Mesika <asaf.mes...@gmail.com> 写道: > > I have several context-related questions: > > > 1. When a topic fails to load, what happens exactly at that stage? Does > it switch over to another broker? Is there a retry on the current broker? > 2. Is there a way to provide a gauge showing how many topics are > unloaded due to non-recoverable error? Or maybe topics unloaded but are > currently retrying? > 3. I presume granularity is per broker. The scenario is: I setup an > alert on rate of this counter > 0 ? Could it be a transient error thus I’m > alerted on something less important? How do you see this in production I > guess is the question. > > > Thanks! > > Asaf > > On 19 Dec 2022 at 10:19:39, Jiuming Tao <jm...@streamnative.io.invalid> > wrote: > >> Hello pulsar community, >> >> I've opened `PIP-231: Add metric for topic load failed` to discuss. >> >> Motivation: >> Currently, we have topic_load_times >> <https://pulsar.apache.org/docs/next/reference-metrics/#broker-metrics> >> metric >> to tracking how long a topic load succeed. >> But when loading a topic, there are may have some chances the topic load >> failed due to MetadataStore or Bookkeeper, and we don't have related >> metrics to track it. >> >> For more details, please read the PIP at >> https://github.com/apache/pulsar/issues/18979 >> I'm looking forward to hearing what you think. >> >> Thanks, >> Tao Jiuming >>