Hi Jiuming,

I would reiterate the problem statement to make it clear (at least for me):

There are cases where a very large amount of topics exists (> 10k per
broker) and are used in Pulsar. Those topics usually have multiple
producers and multiple consumers.
There are metrics that are in the granularity of topics and also in
topic/producers and topic/consumers granularity.
When that happens, the amount of unique metrics is severely high which
causes the response size of /metrics endpoint (the Prometheus Exposition
Format endpoint) to be substantially high - 200MB - 500MB.

Every time the metrics are scraped (30sec, or 1 min), the network gets
surged up due to the /metrics response, thereby causing latency to messages
produced or consumed from that broker.

The solution proposed is to throttle the /metrics response based on a
pre-configured rate limit.

Points to consider for this discussion from the participants:

1. Did you happen to experience such difficulties in your clusters?
2. When that happened, did you experience the bottleneck also on the TSDB
be it metrics ingestion or querying?

Thanks,

Asaf


On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao <jm...@streamnative.io.invalid>
wrote:

> bump
>
> Jiuming Tao <jm...@streamnative.io> 于2022年8月8日周一 18:19写道:
>
> > Hi Pulsar community,
> >
> > In the situation of expose metrics data which has a big size, it will
> lead
> > to:
> > 1. A sudden increase of network usage
> > 2. The latency of pub/sub rising
> >
> > For the purpose of resolving these problems, I’d opened a PR:
> > https://github.com/apache/pulsar/pull/16452
> >
> > Please feel free to help review/discuss about it.
> >
> > Thanks,
> > Tao Jiuming
> >
>

Reply via email to