[DISCUSS] Introduce FlowControl to metrics endpoint

Jiuming Tao Mon, 29 Aug 2022 12:10:11 -0700

Hi Lari,

I had considered about the `remote write API`, but there will be some problems:
1. Does all the Metrics Systems support Prometheus format remote write protocol
2. How to resolve extra label(podName etc.), I don’t know if the extra labels 
will be added atomically when  write to TSDB


Thanks,
Tao Jiuming

> 下面是被转发的邮件：
> 
> 发件人: Lari Hotari <lhot...@apache.org>
> 主题: 回复：[DISCUSS] Introduce FlowControl to metrics endpoint
> 日期: 2022年8月29日 GMT+8 下午6:40:38
> 收件人: dev@pulsar.apache.org
> 回复－收件人: dev@pulsar.apache.org
> 
> Good reiteration of the problem and good points, Asaf.
> 
> I'd like to add a new aspect to the proposal: there might be other
> solutions that would be useful in the case of large amount of topics in a
> Pulsar cluster.
> Rate limiting on the /metrics endpoint doesn't sound like the correct
> approach.
> 
> When there's a huge about of metrics, instead of scraping the metrics, it
> could be more useful to ingest the metrics to Prometheus using the "Remote
> write API".
> There's a recording of a talk explaining remote write in
> https://www.youtube.com/watch?v=vMeCyX3Y3HY .
> The specification is
> https://docs.google.com/document/d/1LPhVRSFkGNSuU1fBd81ulhsCPR4hkSZyyBj1SZ8fWOM/edit#
> .
> The benefit of this could be that /metrics endpoint wouldn't be a
> bottleneck and there wouldn't be a need to do any hacks to support a high
> number of metrics.
> There might be need to route the metrics for different namespaces/topics to
> different destinations. This could be handled in the implementation that
> uses the Remote write API for pushing metrics.
> 
> Regards,
> 
> -Lari
> 
> 
> On Mon, Aug 29, 2022 at 1:12 PM Asaf Mesika <asaf.mes...@gmail.com> wrote:
> 
>> Hi Jiuming,
>> 
>> I would reiterate the problem statement to make it clear (at least for me):
>> 
>> There are cases where a very large amount of topics exists (> 10k per
>> broker) and are used in Pulsar. Those topics usually have multiple
>> producers and multiple consumers.
>> There are metrics that are in the granularity of topics and also in
>> topic/producers and topic/consumers granularity.
>> When that happens, the amount of unique metrics is severely high which
>> causes the response size of /metrics endpoint (the Prometheus Exposition
>> Format endpoint) to be substantially high - 200MB - 500MB.
>> 
>> Every time the metrics are scraped (30sec, or 1 min), the network gets
>> surged up due to the /metrics response, thereby causing latency to messages
>> produced or consumed from that broker.
>> 
>> The solution proposed is to throttle the /metrics response based on a
>> pre-configured rate limit.
>> 
>> Points to consider for this discussion from the participants:
>> 
>> 1. Did you happen to experience such difficulties in your clusters?
>> 2. When that happened, did you experience the bottleneck also on the TSDB
>> be it metrics ingestion or querying?
>> 
>> Thanks,
>> 
>> Asaf
>> 
>> 
>> On Thu, Aug 18, 2022 at 7:40 PM Jiuming Tao <jm...@streamnative.io.invalid
>>> 
>> wrote:
>> 
>>> bump
>>> 
>>> Jiuming Tao <jm...@streamnative.io> 于2022年8月8日周一 18:19写道：
>>> 
>>>> Hi Pulsar community,
>>>> 
>>>> In the situation of expose metrics data which has a big size, it will
>>> lead
>>>> to:
>>>> 1. A sudden increase of network usage
>>>> 2. The latency of pub/sub rising
>>>> 
>>>> For the purpose of resolving these problems, I’d opened a PR:
>>>> https://github.com/apache/pulsar/pull/16452
>>>> 
>>>> Please feel free to help review/discuss about it.
>>>> 
>>>> Thanks,
>>>> Tao Jiuming
>>>> 
>>> 
>>

[DISCUSS] Introduce FlowControl to metrics endpoint

Reply via email to