Hi Lari, I had considered about the `remote write API`, but there will be some problems: 1. Does all the Metrics Systems support Prometheus format remote write protocol 2. How to resolve extra label(podName etc.), I don’t know if the extra labels will be added atomically when write to TSDB
Thanks, Tao Jiuming > 下面是被转发的邮件: > > 发件人: Lari Hotari <lhot...@apache.org> > 主题: 回复:[DISCUSS] Introduce FlowControl to metrics endpoint > 日期: 2022年8月29日 GMT+8 下午6:40:38 > 收件人: dev@pulsar.apache.org > 回复-收件人: dev@pulsar.apache.org > > Good reiteration of the problem and good points, Asaf. > > I'd like to add a new aspect to the proposal: there might be other > solutions that would be useful in the case of large amount of topics in a > Pulsar cluster. > Rate limiting on the /metrics endpoint doesn't sound like the correct > approach. > > When there's a huge about of metrics, instead of scraping the metrics, it > could be more useful to ingest the metrics to Prometheus using the "Remote > write API". > There's a recording of a talk explaining remote write in > https://www.youtube.com/watch?v=vMeCyX3Y3HY . > The specification is > https://docs.google.com/document/d/1LPhVRSFkGNSuU1fBd81ulhsCPR4hkSZyyBj1SZ8fWOM/edit# > . > The benefit of this could be that /metrics endpoint wouldn't be a > bottleneck and there wouldn't be a need to do any hacks to support a high > number of metrics. > There might be need to route the metrics for different namespaces/topics to > different destinations. This could be handled in the implementation that > uses the Remote write API for pushing metrics. > > Regards, > > -Lari > > > On Mon, Aug 29, 2022 at 1:12 PM Asaf Mesika <asaf.mes...@gmail.com> wrote: > >> Hi Jiuming, >> >> I would reiterate the problem statement to make it clear (at least for me): >> >> There are cases where a very large amount of topics exists (> 10k per >> broker) and are used in Pulsar. Those topics usually have multiple >> producers and multiple consumers. >> There are metrics that are in the granularity of topics and also in >> topic/producers and topic/consumers granularity. >> When that happens, the amount of unique metrics is severely high which >> causes the response size of /metrics endpoint (the Prometheus Exposition >> Format endpoint) to be substantially high - 200MB - 500MB. >> >> Every time the metrics are scraped (30sec, or 1 min), the network gets >> surged up due to the /metrics response, thereby causing latency to messages >> produced or consumed from that broker. >> >> The solution proposed is to throttle the /metrics response based on a >> pre-configured rate limit. >> >> Points to consider for this discussion from the participants: >> >> 1. Did you happen to experience such difficulties in your clusters? >> 2. When that happened, did you experience the bottleneck also on the TSDB >> be it metrics ingestion or querying? >> >> Thanks, >> >> Asaf >> >> >> On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao <jm...@streamnative.io.invalid >>> >> wrote: >> >>> bump >>> >>> Jiuming Tao <jm...@streamnative.io> 于2022年8月8日周一 18:19写道: >>> >>>> Hi Pulsar community, >>>> >>>> In the situation of expose metrics data which has a big size, it will >>> lead >>>> to: >>>> 1. A sudden increase of network usage >>>> 2. The latency of pub/sub rising >>>> >>>> For the purpose of resolving these problems, I’d opened a PR: >>>> https://github.com/apache/pulsar/pull/16452 >>>> >>>> Please feel free to help review/discuss about it. >>>> >>>> Thanks, >>>> Tao Jiuming >>>> >>> >>