Re: [prometheus-users] Prometheus RAM usage investigation

Victor Hadianto Tue, 24 Jan 2023 02:44:35 -0800

> Also, what version(s) of prometheus are these two instances?

They are both the same:
prometheus, version 2.37.0 (branch: HEAD, revision:
b41e0750abf5cc18d8233161560731de05199330)


> The RAM usage of Prometheus depends on a number of factors. There's a
calculator embedded in this article, but it's pretty old now:
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

Thanks for this, I'll read & play around with that calculator for our
Prometheus instances (we have 9 in various clusters now).

Regards,
Victor


On Tue, 24 Jan 2023 at 21:03, Brian Candler <[email protected]> wrote:

> Also, what version(s) of prometheus are these two instances? Different
> versions of Prometheus are compiled using different versions of Go, which
> in turn have different degrees of aggressiveness in returning unused RAM to
> the operating system. Also remember Go is a garbage-collected language.
>
> The RAM usage of Prometheus depends on a number of factors. There's a
> calculator embedded in this article, but it's pretty old now:
>
> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>
> On Tuesday, 24 January 2023 at 09:29:47 UTC [email protected] wrote:
>
>> When you say "measured by Kubernetes", what metric specifically?
>>
>> There are several misleading metrics. What matters is
>> `container_memory_rss` or `container_memory_working_set_bytes`. The
>> `container_memmory_usage_bytes` is misleading because it includes page
>> cache values.
>>
>> On Tue, Jan 24, 2023 at 10:20 AM Victor H <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> We are running multiple Prometheus instances in Kubernetes (deployed
>>> using Prometheus Operator) and hope that someone can help us understanding
>>> why the RAM usage in a few of our instances are unexpectedly high (we think
>>> it's cardinality but not sure where to look)
>>>
>>> In Prometheus A, we have the following stat:
>>>
>>> Number of Series: 56486
>>> Number of Chunks: 56684
>>> Number of Label Pairs: 678
>>>
>>> tsdb analyze has the following result:
>>>
>>> /bin $ ./promtool tsdb analyze /prometheus/
>>> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW
>>> Duration: 1h59m59.368s
>>> Series: 56470
>>> Label names: 26
>>> Postings (unique label pairs): 678
>>> Postings entries (total label pairs): 338705
>>>
>>> This instance uses roughly between 4Gb - 5Gb of RAM (measured by
>>> Kubernetes).
>>>
>>> From our reading, each time series should use around 8kb of RAM so for
>>> 56k series should be using a mere 500Mb.
>>>
>>> On a different Prometheus instance (let's call it Prometheus Central) we
>>> have 1,1m series and it's using 9Gb - 10Gb which is roughly what is
>>> expected.
>>>
>>> We're curious about this instance and we believe it's cardinality. We
>>> have a lot more targets in Prometheus A. I also note that the Posting
>>> entries (total label pairs) is 338k but I'm not sure where to look for this.
>>>
>>> The top entries from tsdb analyze is right at the bottom of this post.
>>> The "most common label pairs" entries have alarmingly high count, I wonder
>>> if this contributes the high "total label pairs" and consequently higher
>>> than expected RAM usage.
>>>
>>> When calculating the expected RAM usage, is the "total label pairs" is
>>> the number we need to use rather than the "total series"
>>>
>>> Thanks,
>>> Victor
>>>
>>>
>>> Label pairs most involved in churning:
>>> 296 activity_type=none
>>> 258 workflow_type=PodUpdateWorkflow
>>> 163 __name__=temporal_request_latency_bucket
>>> 104 workflow_type=GenerateSPVarsWorkflow
>>> 95 operation=RespondActivityTaskCompleted
>>> 89 __name__=temporal_activity_execution_latency_bucket
>>> 89 __name__=temporal_activity_schedule_to_start_latency_bucket
>>> 65 workflow_type=PodInitWorkflow
>>> 53 operation=RespondWorkflowTaskCompleted
>>> 49 __name__=temporal_workflow_endtoend_latency_bucket
>>> 49 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
>>> 49 __name__=temporal_workflow_task_execution_latency_bucket
>>> 49 __name__=temporal_workflow_task_replay_latency_bucket
>>> 39 activity_type=UpdatePodConnectionsActivity
>>> 38 le=+Inf
>>> 38 le=0.02
>>> 38 le=0.1
>>> 38 le=0.001
>>> 38 activity_type=GenerateSPVarsActivity
>>> 38 le=5
>>>
>>> Label names most involved in churning:
>>> 734 __name__
>>> 734 job
>>> 724 instance
>>> 577 activity_type
>>> 577 workflow_type
>>> 541 le
>>> 177 operation
>>> 95 datname
>>> 53 datid
>>> 31 mode
>>> 29 namespace
>>> 21 state
>>> 12 quantile
>>> 11 container
>>> 11 service
>>> 11 pod
>>> 11 endpoint
>>> 10 scrape_job
>>> 4 alertname
>>> 4 severity
>>>
>>> Most common label pairs:
>>> 23012 activity_type=none
>>> 20060 workflow_type=PodUpdateWorkflow
>>> 12712 __name__=temporal_request_latency_bucket
>>> 8092 workflow_type=GenerateSPVarsWorkflow
>>> 7440 operation=RespondActivityTaskCompleted
>>> 6944 __name__=temporal_activity_execution_latency_bucket
>>> 6944 __name__=temporal_activity_schedule_to_start_latency_bucket
>>> 5100 workflow_type=PodInitWorkflow
>>> 4140 operation=RespondWorkflowTaskCompleted
>>> 3864 __name__=temporal_workflow_task_replay_latency_bucket
>>> 3864 __name__=temporal_workflow_endtoend_latency_bucket
>>> 3864 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
>>> 3864 __name__=temporal_workflow_task_execution_latency_bucket
>>> 3080 activity_type=UpdatePodConnectionsActivity
>>> 3004 le=0.5
>>> 3004 le=0.01
>>> 3004 le=0.1
>>> 3004 le=1
>>> 3004 le=0.001
>>> 3004 le=0.002
>>>
>>> Label names with highest cumulative label value length:
>>> 8312 scrape_job
>>> 4279 workflow_type
>>> 3994 rule_group
>>> 2614 __name__
>>> 2478 instance
>>> 1564 job
>>> 434 datname
>>> 248 activity_type
>>> 139 mode
>>> 128 operation
>>> 109 version
>>> 97 pod
>>> 88 state
>>> 68 service
>>> 45 le
>>> 44 namespace
>>> 43 slice
>>> 31 container
>>> 28 quantile
>>> 18 alertname
>>>
>>> Highest cardinality labels:
>>> 138 instance
>>> 138 scrape_job
>>> 84 __name__
>>> 75 workflow_type
>>> 71 datname
>>> 70 job
>>> 19 rule_group
>>> 14 le
>>> 10 activity_type
>>> 9 mode
>>> 9 quantile
>>> 6 state
>>> 6 operation
>>> 5 datid
>>> 4 slice
>>> 2 container
>>> 2 pod
>>> 2 alertname
>>> 2 version
>>> 2 service
>>>
>>> Highest cardinality metric names:
>>> 12712 temporal_request_latency_bucket
>>> 6944 temporal_activity_execution_latency_bucket
>>> 6944 temporal_activity_schedule_to_start_latency_bucket
>>> 3864 temporal_workflow_task_schedule_to_start_latency_bucket
>>> 3864 temporal_workflow_task_replay_latency_bucket
>>> 3864 temporal_workflow_task_execution_latency_bucket
>>> 3864 temporal_workflow_endtoend_latency_bucket
>>> 2448 pg_locks_count
>>> 1632 pg_stat_activity_count
>>> 908 temporal_request
>>> 690 prometheus_target_sync_length_seconds
>>> 496 temporal_activity_execution_latency_count
>>> 350 go_gc_duration_seconds
>>> 340 pg_stat_database_tup_inserted
>>> 340 pg_stat_database_temp_bytes
>>> 340 pg_stat_database_xact_commit
>>> 340 pg_stat_database_xact_rollback
>>> 340 pg_stat_database_tup_updated
>>> 340 pg_stat_database_deadlocks
>>> 340 pg_stat_database_tup_returned
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Prometheus Users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/prometheus-users/_yUpPWtFaQA/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANP6zPKHQkSZPcQ%3Dcj1obbq4RfcnnE_eOJqEkYtvEwOqAE6EgQ%40mail.gmail.com.

Re: [prometheus-users] Prometheus RAM usage investigation

Reply via email to