Re: [prometheus-users] Prometheus RAM usage investigation

Omero Saienni Sun, 19 Feb 2023 16:45:40 -0800

I upgraded Prometheus from 2.37.0 to 2.37.5 and I see negligible difference 
in memory consumption.


Constants: 

Number label pairs in prometheus-prometheus-my-namespace-0: 455
Number of Targets in prometheus-prometheus-my-namespace-0: 392

What do you suggest we do?

# Analysis

Number label pairs in prometheus-prometheus-my-namespace-0: 455
Number of Targets in prometheus-prometheus-my-namespace-0: 392

## Version: v2.37.0

## Version: v2.37.0 - Trough

```sh
$ kubectl top pod prometheus-prometheus-my-namespace-0
NAME CPU(cores) MEMORY(bytes) 
prometheus-prometheus-my-namespace-0 31m 8748Mi 
```

## Version: v2.37.0 - Peak

```sh
$ kubectl top pod prometheus-prometheus-my-namespace-0
NAME CPU(cores) MEMORY(bytes) 
prometheus-prometheus-my-namespace-0 31m 12160Mi 
```

## Version: v2.37.5

### Version: v2.37.5 - Trough

```sh
$ kubectl top pod prometheus-prometheus-my-namespace-0
NAME CPU(cores) MEMORY(bytes) 
prometheus-prometheus-my-namespace-0 31m 8338Mi
```

## Version: v2.37.5 - Peak

```sh
$ kubectl top pod prometheus-prometheus-my-namespace-0
NAME CPU(cores) MEMORY(bytes) 
prometheus-prometheus-my-namespace-0 241m 11698Mi 
```
On Friday, 17 February 2023 at 14:28:59 UTC+13 Omero Saienni wrote:

> I will upgrade to the LTS. 
>
> I did upgrade to the latest helm chart and did see very little difference 
> but I will send you all some metrics and see how we can proceed.  
>
> Thanks
>
> On Thursday, 2 February 2023 at 00:07:29 UTC+13 Brian Candler wrote:
>
>> That makes sense.  Hopefully the LTS support for 2.37 can be extended in 
>> the mean time.
>>
>> On Wednesday, 1 February 2023 at 10:45:34 UTC Julien Pivotto wrote:
>>
>>> On 01 Feb 02:00, Brian Candler wrote: 
>>> > Aside: is 2.42.0 going to be an LTS version? 
>>>
>>> Hello, 
>>>
>>> I have not updated the website yet, but 2.42 will not be a LTS version. 
>>>
>>> My feeling is that we still need a few releases so that the native 
>>> histogram and OOO ingestion "stabilizes". It is not about waiting for 
>>> them to be stable, but more making sure that the eventual bugs 
>>> introduced in the codebase by those two major features are noticed and 
>>> fixed. 
>>>
>>>
>>> > 
>>> > On Wednesday, 1 February 2023 at 09:35:00 UTC [email protected] wrote: 
>>> > 
>>> > > Or upgrade to 2.42.0. :) 
>>> > > 
>>> > > On Wed, Feb 1, 2023 at 9:48 AM Julien Pivotto <
>>> [email protected]> 
>>> > > wrote: 
>>> > > 
>>> > >> On 24 Jan 21:43, Victor Hadianto wrote: 
>>> > >> > > Also, what version(s) of prometheus are these two instances? 
>>> > >> > 
>>> > >> > They are both the same: 
>>> > >> > prometheus, version 2.37.0 (branch: HEAD, revision: 
>>> > >> > b41e0750abf5cc18d8233161560731de05199330) 
>>> > >> 
>>> > >> Please update to 2.37.5. There has been a memory leak fixed in 
>>> 2.37.3. 
>>> > >> 
>>> > >> 
>>> > >> 
>>> > >> > 
>>> > >> > > The RAM usage of Prometheus depends on a number of factors. 
>>> There's a 
>>> > >> > calculator embedded in this article, but it's pretty old now: 
>>> > >> > 
>>> > >> 
>>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>>>  
>>> > >> > 
>>> > >> > Thanks for this, I'll read & play around with that calculator for 
>>> our 
>>> > >> > Prometheus instances (we have 9 in various clusters now). 
>>> > >> > 
>>> > >> > Regards, 
>>> > >> > Victor 
>>> > >> > 
>>> > >> > 
>>> > >> > On Tue, 24 Jan 2023 at 21:03, Brian Candler <[email protected]> 
>>> wrote: 
>>> > >> > 
>>> > >> > > Also, what version(s) of prometheus are these two instances? 
>>> Different 
>>> > >> > > versions of Prometheus are compiled using different versions of 
>>> Go, 
>>> > >> which 
>>> > >> > > in turn have different degrees of aggressiveness in returning 
>>> unused 
>>> > >> RAM to 
>>> > >> > > the operating system. Also remember Go is a garbage-collected 
>>> > >> language. 
>>> > >> > > 
>>> > >> > > The RAM usage of Prometheus depends on a number of factors. 
>>> There's a 
>>> > >> > > calculator embedded in this article, but it's pretty old now: 
>>> > >> > > 
>>> > >> > > 
>>> > >> 
>>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>>>  
>>> > >> > > 
>>> > >> > > On Tuesday, 24 January 2023 at 09:29:47 UTC [email protected] 
>>> wrote: 
>>> > >> > > 
>>> > >> > >> When you say "measured by Kubernetes", what metric 
>>> specifically? 
>>> > >> > >> 
>>> > >> > >> There are several misleading metrics. What matters is 
>>> > >> > >> `container_memory_rss` or 
>>> `container_memory_working_set_bytes`. The 
>>> > >> > >> `container_memmory_usage_bytes` is misleading because it 
>>> includes 
>>> > >> page 
>>> > >> > >> cache values. 
>>> > >> > >> 
>>> > >> > >> On Tue, Jan 24, 2023 at 10:20 AM Victor H <[email protected]> 
>>> wrote: 
>>> > >> > >> 
>>> > >> > >>> Hi, 
>>> > >> > >>> 
>>> > >> > >>> We are running multiple Prometheus instances in Kubernetes 
>>> (deployed 
>>> > >> > >>> using Prometheus Operator) and hope that someone can help us 
>>> > >> understanding 
>>> > >> > >>> why the RAM usage in a few of our instances are unexpectedly 
>>> high 
>>> > >> (we think 
>>> > >> > >>> it's cardinality but not sure where to look) 
>>> > >> > >>> 
>>> > >> > >>> In Prometheus A, we have the following stat: 
>>> > >> > >>> 
>>> > >> > >>> Number of Series: 56486 
>>> > >> > >>> Number of Chunks: 56684 
>>> > >> > >>> Number of Label Pairs: 678 
>>> > >> > >>> 
>>> > >> > >>> tsdb analyze has the following result: 
>>> > >> > >>> 
>>> > >> > >>> /bin $ ./promtool tsdb analyze /prometheus/ 
>>> > >> > >>> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW 
>>> > >> > >>> Duration: 1h59m59.368s 
>>> > >> > >>> Series: 56470 
>>> > >> > >>> Label names: 26 
>>> > >> > >>> Postings (unique label pairs): 678 
>>> > >> > >>> Postings entries (total label pairs): 338705 
>>> > >> > >>> 
>>> > >> > >>> This instance uses roughly between 4Gb - 5Gb of RAM (measured 
>>> by 
>>> > >> > >>> Kubernetes). 
>>> > >> > >>> 
>>> > >> > >>> From our reading, each time series should use around 8kb of 
>>> RAM so 
>>> > >> for 
>>> > >> > >>> 56k series should be using a mere 500Mb. 
>>> > >> > >>> 
>>> > >> > >>> On a different Prometheus instance (let's call it Prometheus 
>>> > >> Central) we 
>>> > >> > >>> have 1,1m series and it's using 9Gb - 10Gb which is roughly 
>>> what is 
>>> > >> > >>> expected. 
>>> > >> > >>> 
>>> > >> > >>> We're curious about this instance and we believe it's 
>>> cardinality. 
>>> > >> We 
>>> > >> > >>> have a lot more targets in Prometheus A. I also note that the 
>>> > >> Posting 
>>> > >> > >>> entries (total label pairs) is 338k but I'm not sure where to 
>>> look 
>>> > >> for this. 
>>> > >> > >>> 
>>> > >> > >>> The top entries from tsdb analyze is right at the bottom of 
>>> this 
>>> > >> post. 
>>> > >> > >>> The "most common label pairs" entries have alarmingly high 
>>> count, I 
>>> > >> wonder 
>>> > >> > >>> if this contributes the high "total label pairs" and 
>>> consequently 
>>> > >> higher 
>>> > >> > >>> than expected RAM usage. 
>>> > >> > >>> 
>>> > >> > >>> When calculating the expected RAM usage, is the "total label 
>>> pairs" 
>>> > >> is 
>>> > >> > >>> the number we need to use rather than the "total series" 
>>> > >> > >>> 
>>> > >> > >>> Thanks, 
>>> > >> > >>> Victor 
>>> > >> > >>> 
>>> > >> > >>> 
>>> > >> > >>> Label pairs most involved in churning: 
>>> > >> > >>> 296 activity_type=none 
>>> > >> > >>> 258 workflow_type=PodUpdateWorkflow 
>>> > >> > >>> 163 __name__=temporal_request_latency_bucket 
>>> > >> > >>> 104 workflow_type=GenerateSPVarsWorkflow 
>>> > >> > >>> 95 operation=RespondActivityTaskCompleted 
>>> > >> > >>> 89 __name__=temporal_activity_execution_latency_bucket 
>>> > >> > >>> 89 
>>> __name__=temporal_activity_schedule_to_start_latency_bucket 
>>> > >> > >>> 65 workflow_type=PodInitWorkflow 
>>> > >> > >>> 53 operation=RespondWorkflowTaskCompleted 
>>> > >> > >>> 49 __name__=temporal_workflow_endtoend_latency_bucket 
>>> > >> > >>> 49 
>>> __name__=temporal_workflow_task_schedule_to_start_latency_bucket 
>>> > >> > >>> 49 __name__=temporal_workflow_task_execution_latency_bucket 
>>> > >> > >>> 49 __name__=temporal_workflow_task_replay_latency_bucket 
>>> > >> > >>> 39 activity_type=UpdatePodConnectionsActivity 
>>> > >> > >>> 38 le=+Inf 
>>> > >> > >>> 38 le=0.02 
>>> > >> > >>> 38 le=0.1 
>>> > >> > >>> 38 le=0.001 
>>> > >> > >>> 38 activity_type=GenerateSPVarsActivity 
>>> > >> > >>> 38 le=5 
>>> > >> > >>> 
>>> > >> > >>> Label names most involved in churning: 
>>> > >> > >>> 734 __name__ 
>>> > >> > >>> 734 job 
>>> > >> > >>> 724 instance 
>>> > >> > >>> 577 activity_type 
>>> > >> > >>> 577 workflow_type 
>>> > >> > >>> 541 le 
>>> > >> > >>> 177 operation 
>>> > >> > >>> 95 datname 
>>> > >> > >>> 53 datid 
>>> > >> > >>> 31 mode 
>>> > >> > >>> 29 namespace 
>>> > >> > >>> 21 state 
>>> > >> > >>> 12 quantile 
>>> > >> > >>> 11 container 
>>> > >> > >>> 11 service 
>>> > >> > >>> 11 pod 
>>> > >> > >>> 11 endpoint 
>>> > >> > >>> 10 scrape_job 
>>> > >> > >>> 4 alertname 
>>> > >> > >>> 4 severity 
>>> > >> > >>> 
>>> > >> > >>> Most common label pairs: 
>>> > >> > >>> 23012 activity_type=none 
>>> > >> > >>> 20060 workflow_type=PodUpdateWorkflow 
>>> > >> > >>> 12712 __name__=temporal_request_latency_bucket 
>>> > >> > >>> 8092 workflow_type=GenerateSPVarsWorkflow 
>>> > >> > >>> 7440 operation=RespondActivityTaskCompleted 
>>> > >> > >>> 6944 __name__=temporal_activity_execution_latency_bucket 
>>> > >> > >>> 6944 
>>> __name__=temporal_activity_schedule_to_start_latency_bucket 
>>> > >> > >>> 5100 workflow_type=PodInitWorkflow 
>>> > >> > >>> 4140 operation=RespondWorkflowTaskCompleted 
>>> > >> > >>> 3864 __name__=temporal_workflow_task_replay_latency_bucket 
>>> > >> > >>> 3864 __name__=temporal_workflow_endtoend_latency_bucket 
>>> > >> > >>> 3864 
>>> > >> __name__=temporal_workflow_task_schedule_to_start_latency_bucket 
>>> > >> > >>> 3864 __name__=temporal_workflow_task_execution_latency_bucket 
>>> > >> > >>> 3080 activity_type=UpdatePodConnectionsActivity 
>>> > >> > >>> 3004 le=0.5 
>>> > >> > >>> 3004 le=0.01 
>>> > >> > >>> 3004 le=0.1 
>>> > >> > >>> 3004 le=1 
>>> > >> > >>> 3004 le=0.001 
>>> > >> > >>> 3004 le=0.002 
>>> > >> > >>> 
>>> > >> > >>> Label names with highest cumulative label value length: 
>>> > >> > >>> 8312 scrape_job 
>>> > >> > >>> 4279 workflow_type 
>>> > >> > >>> 3994 rule_group 
>>> > >> > >>> 2614 __name__ 
>>> > >> > >>> 2478 instance 
>>> > >> > >>> 1564 job 
>>> > >> > >>> 434 datname 
>>> > >> > >>> 248 activity_type 
>>> > >> > >>> 139 mode 
>>> > >> > >>> 128 operation 
>>> > >> > >>> 109 version 
>>> > >> > >>> 97 pod 
>>> > >> > >>> 88 state 
>>> > >> > >>> 68 service 
>>> > >> > >>> 45 le 
>>> > >> > >>> 44 namespace 
>>> > >> > >>> 43 slice 
>>> > >> > >>> 31 container 
>>> > >> > >>> 28 quantile 
>>> > >> > >>> 18 alertname 
>>> > >> > >>> 
>>> > >> > >>> Highest cardinality labels: 
>>> > >> > >>> 138 instance 
>>> > >> > >>> 138 scrape_job 
>>> > >> > >>> 84 __name__ 
>>> > >> > >>> 75 workflow_type 
>>> > >> > >>> 71 datname 
>>> > >> > >>> 70 job 
>>> > >> > >>> 19 rule_group 
>>> > >> > >>> 14 le 
>>> > >> > >>> 10 activity_type 
>>> > >> > >>> 9 mode 
>>> > >> > >>> 9 quantile 
>>> > >> > >>> 6 state 
>>> > >> > >>> 6 operation 
>>> > >> > >>> 5 datid 
>>> > >> > >>> 4 slice 
>>> > >> > >>> 2 container 
>>> > >> > >>> 2 pod 
>>> > >> > >>> 2 alertname 
>>> > >> > >>> 2 version 
>>> > >> > >>> 2 service 
>>> > >> > >>> 
>>> > >> > >>> Highest cardinality metric names: 
>>> > >> > >>> 12712 temporal_request_latency_bucket 
>>> > >> > >>> 6944 temporal_activity_execution_latency_bucket 
>>> > >> > >>> 6944 temporal_activity_schedule_to_start_latency_bucket 
>>> > >> > >>> 3864 temporal_workflow_task_schedule_to_start_latency_bucket 
>>> > >> > >>> 3864 temporal_workflow_task_replay_latency_bucket 
>>> > >> > >>> 3864 temporal_workflow_task_execution_latency_bucket 
>>> > >> > >>> 3864 temporal_workflow_endtoend_latency_bucket 
>>> > >> > >>> 2448 pg_locks_count 
>>> > >> > >>> 1632 pg_stat_activity_count 
>>> > >> > >>> 908 temporal_request 
>>> > >> > >>> 690 prometheus_target_sync_length_seconds 
>>> > >> > >>> 496 temporal_activity_execution_latency_count 
>>> > >> > >>> 350 go_gc_duration_seconds 
>>> > >> > >>> 340 pg_stat_database_tup_inserted 
>>> > >> > >>> 340 pg_stat_database_temp_bytes 
>>> > >> > >>> 340 pg_stat_database_xact_commit 
>>> > >> > >>> 340 pg_stat_database_xact_rollback 
>>> > >> > >>> 340 pg_stat_database_tup_updated 
>>> > >> > >>> 340 pg_stat_database_deadlocks 
>>> > >> > >>> 340 pg_stat_database_tup_returned 
>>> > >> > >>> 
>>> > >> > >>> 
>>> > >> > >>> 
>>> > >> > >>> 
>>> > >> > >>> 
>>> > >> > >>> 
>>> > >> > >>> -- 
>>> > >> > >>> You received this message because you are subscribed to the 
>>> Google 
>>> > >> > >>> Groups "Prometheus Users" group. 
>>> > >> > >>> To unsubscribe from this group and stop receiving emails from 
>>> it, 
>>> > >> send 
>>> > >> > >>> an email to [email protected]. 
>>> > >> > >>> To view this discussion on the web visit 
>>> > >> > >>> 
>>> > >> 
>>> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com
>>>  
>>> > >> > >>> < 
>>> > >> 
>>> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer
>>>  
>>> > >> > 
>>> > >> > >>> . 
>>> > >> > >>> 
>>> > >> > >> -- 
>>> > >> > > You received this message because you are subscribed to a topic 
>>> in the 
>>> > >> > > Google Groups "Prometheus Users" group. 
>>> > >> > > To unsubscribe from this topic, visit 
>>> > >> > > 
>>> > >> 
>>> https://groups.google.com/d/topic/prometheus-users/_yUpPWtFaQA/unsubscribe 
>>> > >> > > . 
>>> > >> > > To unsubscribe from this group and all its topics, send an 
>>> email to 
>>> > >> > > [email protected]. 
>>> > >> > > To view this discussion on the web visit 
>>> > >> > > 
>>> > >> 
>>> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com
>>>  
>>> > >> > > < 
>>> > >> 
>>> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com?utm_medium=email&utm_source=footer
>>>  
>>> > >> > 
>>> > >> > > . 
>>> > >> > > 
>>> > >> > 
>>> > >> > -- 
>>> > >> > You received this message because you are subscribed to the 
>>> Google 
>>> > >> Groups "Prometheus Users" group. 
>>> > >> > To unsubscribe from this group and stop receiving emails from it, 
>>> send 
>>> > >> an email to [email protected]. 
>>> > >> > To view this discussion on the web visit 
>>> > >> 
>>> https://groups.google.com/d/msgid/prometheus-users/CANP6zPKHQkSZPcQ%3Dcj1obbq4RfcnnE_eOJqEkYtvEwOqAE6EgQ%40mail.gmail.com
>>>  
>>> > >> . 
>>> > >> 
>>> > >> -- 
>>> > >> Julien Pivotto 
>>> > >> @roidelapluie 
>>> > >> 
>>> > >> -- 
>>> > >> You received this message because you are subscribed to the Google 
>>> Groups 
>>> > >> "Prometheus Users" group. 
>>> > >> To unsubscribe from this group and stop receiving emails from it, 
>>> send an 
>>> > >> email to [email protected]. 
>>> > >> 
>>> > > To view this discussion on the web visit 
>>> > >> 
>>> https://groups.google.com/d/msgid/prometheus-users/Y9onaJkBb8Quugae%40nixos 
>>> > >> . 
>>> > >> 
>>> > > 
>>> > 
>>> > -- 
>>> > You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group. 
>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected]. 
>>> > To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/b1a2bd98-b65f-40f0-b92b-52fe8f34febbn%40googlegroups.com.
>>>  
>>>
>>>
>>>
>>> -- 
>>> Julien Pivotto 
>>> @roidelapluie 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7a8bd1d6-993b-42b0-9c3c-ac8c175d1895n%40googlegroups.com.

Re: [prometheus-users] Prometheus RAM usage investigation

Reply via email to