Hi Jiuming, I would reiterate the problem statement to make it clear (at least for me):
There are cases where a very large amount of topics exists (> 10k per broker) and are used in Pulsar. Those topics usually have multiple producers and multiple consumers. There are metrics that are in the granularity of topics and also in topic/producers and topic/consumers granularity. When that happens, the amount of unique metrics is severely high which causes the response size of /metrics endpoint (the Prometheus Exposition Format endpoint) to be substantially high - 200MB - 500MB. Every time the metrics are scraped (30sec, or 1 min), the network gets surged up due to the /metrics response, thereby causing latency to messages produced or consumed from that broker. The solution proposed is to throttle the /metrics response based on a pre-configured rate limit. Points to consider for this discussion from the participants: 1. Did you happen to experience such difficulties in your clusters? 2. When that happened, did you experience the bottleneck also on the TSDB be it metrics ingestion or querying? Thanks, Asaf On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao <jm...@streamnative.io.invalid> wrote: > bump > > Jiuming Tao <jm...@streamnative.io> 于2022年8月8日周一 18:19写道: > > > Hi Pulsar community, > > > > In the situation of expose metrics data which has a big size, it will > lead > > to: > > 1. A sudden increase of network usage > > 2. The latency of pub/sub rising > > > > For the purpose of resolving these problems, I’d opened a PR: > > https://github.com/apache/pulsar/pull/16452 > > > > Please feel free to help review/discuss about it. > > > > Thanks, > > Tao Jiuming > > >