Good reiteration of the problem and good points, Asaf. I'd like to add a new aspect to the proposal: there might be other solutions that would be useful in the case of large amount of topics in a Pulsar cluster. Rate limiting on the /metrics endpoint doesn't sound like the correct approach.
When there's a huge about of metrics, instead of scraping the metrics, it could be more useful to ingest the metrics to Prometheus using the "Remote write API". There's a recording of a talk explaining remote write in https://www.youtube.com/watch?v=vMeCyX3Y3HY . The specification is https://docs.google.com/document/d/1LPhVRSFkGNSuU1fBd81ulhsCPR4hkSZyyBj1SZ8fWOM/edit# . The benefit of this could be that /metrics endpoint wouldn't be a bottleneck and there wouldn't be a need to do any hacks to support a high number of metrics. There might be need to route the metrics for different namespaces/topics to different destinations. This could be handled in the implementation that uses the Remote write API for pushing metrics. Regards, -Lari On Mon, Aug 29, 2022 at 1:12 PM Asaf Mesika <asaf.mes...@gmail.com> wrote: > Hi Jiuming, > > I would reiterate the problem statement to make it clear (at least for me): > > There are cases where a very large amount of topics exists (> 10k per > broker) and are used in Pulsar. Those topics usually have multiple > producers and multiple consumers. > There are metrics that are in the granularity of topics and also in > topic/producers and topic/consumers granularity. > When that happens, the amount of unique metrics is severely high which > causes the response size of /metrics endpoint (the Prometheus Exposition > Format endpoint) to be substantially high - 200MB - 500MB. > > Every time the metrics are scraped (30sec, or 1 min), the network gets > surged up due to the /metrics response, thereby causing latency to messages > produced or consumed from that broker. > > The solution proposed is to throttle the /metrics response based on a > pre-configured rate limit. > > Points to consider for this discussion from the participants: > > 1. Did you happen to experience such difficulties in your clusters? > 2. When that happened, did you experience the bottleneck also on the TSDB > be it metrics ingestion or querying? > > Thanks, > > Asaf > > > On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao <jm...@streamnative.io.invalid > > > wrote: > > > bump > > > > Jiuming Tao <jm...@streamnative.io> 于2022年8月8日周一 18:19写道: > > > > > Hi Pulsar community, > > > > > > In the situation of expose metrics data which has a big size, it will > > lead > > > to: > > > 1. A sudden increase of network usage > > > 2. The latency of pub/sub rising > > > > > > For the purpose of resolving these problems, I’d opened a PR: > > > https://github.com/apache/pulsar/pull/16452 > > > > > > Please feel free to help review/discuss about it. > > > > > > Thanks, > > > Tao Jiuming > > > > > >