Re: [prometheus-users] Prometheus RAM usage investigation

Brian Candler Wed, 01 Feb 2023 02:00:34 -0800

Aside: is 2.42.0 going to be an LTS version?

On Wednesday, 1 February 2023 at 09:35:00 UTC [email protected] wrote:


> Or upgrade to 2.42.0. :)
>
> On Wed, Feb 1, 2023 at 9:48 AM Julien Pivotto <[email protected]> 
> wrote:
>
>> On 24 Jan 21:43, Victor Hadianto wrote:
>> > > Also, what version(s) of prometheus are these two instances?
>> > 
>> > They are both the same:
>> > prometheus, version 2.37.0 (branch: HEAD, revision:
>> > b41e0750abf5cc18d8233161560731de05199330)
>>
>> Please update to 2.37.5. There has been a memory leak fixed in 2.37.3.
>>
>>
>>
>> > 
>> > > The RAM usage of Prometheus depends on a number of factors. There's a
>> > calculator embedded in this article, but it's pretty old now:
>> > 
>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>> > 
>> > Thanks for this, I'll read & play around with that calculator for our
>> > Prometheus instances (we have 9 in various clusters now).
>> > 
>> > Regards,
>> > Victor
>> > 
>> > 
>> > On Tue, 24 Jan 2023 at 21:03, Brian Candler <[email protected]> wrote:
>> > 
>> > > Also, what version(s) of prometheus are these two instances? Different
>> > > versions of Prometheus are compiled using different versions of Go, 
>> which
>> > > in turn have different degrees of aggressiveness in returning unused 
>> RAM to
>> > > the operating system. Also remember Go is a garbage-collected 
>> language.
>> > >
>> > > The RAM usage of Prometheus depends on a number of factors. There's a
>> > > calculator embedded in this article, but it's pretty old now:
>> > >
>> > > 
>> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
>> > >
>> > > On Tuesday, 24 January 2023 at 09:29:47 UTC [email protected] wrote:
>> > >
>> > >> When you say "measured by Kubernetes", what metric specifically?
>> > >>
>> > >> There are several misleading metrics. What matters is
>> > >> `container_memory_rss` or `container_memory_working_set_bytes`. The
>> > >> `container_memmory_usage_bytes` is misleading because it includes 
>> page
>> > >> cache values.
>> > >>
>> > >> On Tue, Jan 24, 2023 at 10:20 AM Victor H <[email protected]> wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> We are running multiple Prometheus instances in Kubernetes (deployed
>> > >>> using Prometheus Operator) and hope that someone can help us 
>> understanding
>> > >>> why the RAM usage in a few of our instances are unexpectedly high 
>> (we think
>> > >>> it's cardinality but not sure where to look)
>> > >>>
>> > >>> In Prometheus A, we have the following stat:
>> > >>>
>> > >>> Number of Series: 56486
>> > >>> Number of Chunks: 56684
>> > >>> Number of Label Pairs: 678
>> > >>>
>> > >>> tsdb analyze has the following result:
>> > >>>
>> > >>> /bin $ ./promtool tsdb analyze /prometheus/
>> > >>> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW
>> > >>> Duration: 1h59m59.368s
>> > >>> Series: 56470
>> > >>> Label names: 26
>> > >>> Postings (unique label pairs): 678
>> > >>> Postings entries (total label pairs): 338705
>> > >>>
>> > >>> This instance uses roughly between 4Gb - 5Gb of RAM (measured by
>> > >>> Kubernetes).
>> > >>>
>> > >>> From our reading, each time series should use around 8kb of RAM so 
>> for
>> > >>> 56k series should be using a mere 500Mb.
>> > >>>
>> > >>> On a different Prometheus instance (let's call it Prometheus 
>> Central) we
>> > >>> have 1,1m series and it's using 9Gb - 10Gb which is roughly what is
>> > >>> expected.
>> > >>>
>> > >>> We're curious about this instance and we believe it's cardinality. 
>> We
>> > >>> have a lot more targets in Prometheus A. I also note that the 
>> Posting
>> > >>> entries (total label pairs) is 338k but I'm not sure where to look 
>> for this.
>> > >>>
>> > >>> The top entries from tsdb analyze is right at the bottom of this 
>> post.
>> > >>> The "most common label pairs" entries have alarmingly high count, I 
>> wonder
>> > >>> if this contributes the high "total label pairs" and consequently 
>> higher
>> > >>> than expected RAM usage.
>> > >>>
>> > >>> When calculating the expected RAM usage, is the "total label pairs" 
>> is
>> > >>> the number we need to use rather than the "total series"
>> > >>>
>> > >>> Thanks,
>> > >>> Victor
>> > >>>
>> > >>>
>> > >>> Label pairs most involved in churning:
>> > >>> 296 activity_type=none
>> > >>> 258 workflow_type=PodUpdateWorkflow
>> > >>> 163 __name__=temporal_request_latency_bucket
>> > >>> 104 workflow_type=GenerateSPVarsWorkflow
>> > >>> 95 operation=RespondActivityTaskCompleted
>> > >>> 89 __name__=temporal_activity_execution_latency_bucket
>> > >>> 89 __name__=temporal_activity_schedule_to_start_latency_bucket
>> > >>> 65 workflow_type=PodInitWorkflow
>> > >>> 53 operation=RespondWorkflowTaskCompleted
>> > >>> 49 __name__=temporal_workflow_endtoend_latency_bucket
>> > >>> 49 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
>> > >>> 49 __name__=temporal_workflow_task_execution_latency_bucket
>> > >>> 49 __name__=temporal_workflow_task_replay_latency_bucket
>> > >>> 39 activity_type=UpdatePodConnectionsActivity
>> > >>> 38 le=+Inf
>> > >>> 38 le=0.02
>> > >>> 38 le=0.1
>> > >>> 38 le=0.001
>> > >>> 38 activity_type=GenerateSPVarsActivity
>> > >>> 38 le=5
>> > >>>
>> > >>> Label names most involved in churning:
>> > >>> 734 __name__
>> > >>> 734 job
>> > >>> 724 instance
>> > >>> 577 activity_type
>> > >>> 577 workflow_type
>> > >>> 541 le
>> > >>> 177 operation
>> > >>> 95 datname
>> > >>> 53 datid
>> > >>> 31 mode
>> > >>> 29 namespace
>> > >>> 21 state
>> > >>> 12 quantile
>> > >>> 11 container
>> > >>> 11 service
>> > >>> 11 pod
>> > >>> 11 endpoint
>> > >>> 10 scrape_job
>> > >>> 4 alertname
>> > >>> 4 severity
>> > >>>
>> > >>> Most common label pairs:
>> > >>> 23012 activity_type=none
>> > >>> 20060 workflow_type=PodUpdateWorkflow
>> > >>> 12712 __name__=temporal_request_latency_bucket
>> > >>> 8092 workflow_type=GenerateSPVarsWorkflow
>> > >>> 7440 operation=RespondActivityTaskCompleted
>> > >>> 6944 __name__=temporal_activity_execution_latency_bucket
>> > >>> 6944 __name__=temporal_activity_schedule_to_start_latency_bucket
>> > >>> 5100 workflow_type=PodInitWorkflow
>> > >>> 4140 operation=RespondWorkflowTaskCompleted
>> > >>> 3864 __name__=temporal_workflow_task_replay_latency_bucket
>> > >>> 3864 __name__=temporal_workflow_endtoend_latency_bucket
>> > >>> 3864 
>> __name__=temporal_workflow_task_schedule_to_start_latency_bucket
>> > >>> 3864 __name__=temporal_workflow_task_execution_latency_bucket
>> > >>> 3080 activity_type=UpdatePodConnectionsActivity
>> > >>> 3004 le=0.5
>> > >>> 3004 le=0.01
>> > >>> 3004 le=0.1
>> > >>> 3004 le=1
>> > >>> 3004 le=0.001
>> > >>> 3004 le=0.002
>> > >>>
>> > >>> Label names with highest cumulative label value length:
>> > >>> 8312 scrape_job
>> > >>> 4279 workflow_type
>> > >>> 3994 rule_group
>> > >>> 2614 __name__
>> > >>> 2478 instance
>> > >>> 1564 job
>> > >>> 434 datname
>> > >>> 248 activity_type
>> > >>> 139 mode
>> > >>> 128 operation
>> > >>> 109 version
>> > >>> 97 pod
>> > >>> 88 state
>> > >>> 68 service
>> > >>> 45 le
>> > >>> 44 namespace
>> > >>> 43 slice
>> > >>> 31 container
>> > >>> 28 quantile
>> > >>> 18 alertname
>> > >>>
>> > >>> Highest cardinality labels:
>> > >>> 138 instance
>> > >>> 138 scrape_job
>> > >>> 84 __name__
>> > >>> 75 workflow_type
>> > >>> 71 datname
>> > >>> 70 job
>> > >>> 19 rule_group
>> > >>> 14 le
>> > >>> 10 activity_type
>> > >>> 9 mode
>> > >>> 9 quantile
>> > >>> 6 state
>> > >>> 6 operation
>> > >>> 5 datid
>> > >>> 4 slice
>> > >>> 2 container
>> > >>> 2 pod
>> > >>> 2 alertname
>> > >>> 2 version
>> > >>> 2 service
>> > >>>
>> > >>> Highest cardinality metric names:
>> > >>> 12712 temporal_request_latency_bucket
>> > >>> 6944 temporal_activity_execution_latency_bucket
>> > >>> 6944 temporal_activity_schedule_to_start_latency_bucket
>> > >>> 3864 temporal_workflow_task_schedule_to_start_latency_bucket
>> > >>> 3864 temporal_workflow_task_replay_latency_bucket
>> > >>> 3864 temporal_workflow_task_execution_latency_bucket
>> > >>> 3864 temporal_workflow_endtoend_latency_bucket
>> > >>> 2448 pg_locks_count
>> > >>> 1632 pg_stat_activity_count
>> > >>> 908 temporal_request
>> > >>> 690 prometheus_target_sync_length_seconds
>> > >>> 496 temporal_activity_execution_latency_count
>> > >>> 350 go_gc_duration_seconds
>> > >>> 340 pg_stat_database_tup_inserted
>> > >>> 340 pg_stat_database_temp_bytes
>> > >>> 340 pg_stat_database_xact_commit
>> > >>> 340 pg_stat_database_xact_rollback
>> > >>> 340 pg_stat_database_tup_updated
>> > >>> 340 pg_stat_database_deadlocks
>> > >>> 340 pg_stat_database_tup_returned
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> You received this message because you are subscribed to the Google
>> > >>> Groups "Prometheus Users" group.
>> > >>> To unsubscribe from this group and stop receiving emails from it, 
>> send
>> > >>> an email to [email protected].
>> > >>> To view this discussion on the web visit
>> > >>> 
>> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com
>> > >>> <
>> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer
>> >
>> > >>> .
>> > >>>
>> > >> --
>> > > You received this message because you are subscribed to a topic in the
>> > > Google Groups "Prometheus Users" group.
>> > > To unsubscribe from this topic, visit
>> > > 
>> https://groups.google.com/d/topic/prometheus-users/_yUpPWtFaQA/unsubscribe
>> > > .
>> > > To unsubscribe from this group and all its topics, send an email to
>> > > [email protected].
>> > > To view this discussion on the web visit
>> > > 
>> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com
>> > > <
>> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com?utm_medium=email&utm_source=footer
>> >
>> > > .
>> > >
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups "Prometheus Users" group.
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to [email protected].
>> > To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/CANP6zPKHQkSZPcQ%3Dcj1obbq4RfcnnE_eOJqEkYtvEwOqAE6EgQ%40mail.gmail.com
>> .
>>
>> -- 
>> Julien Pivotto
>> @roidelapluie
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/Y9onaJkBb8Quugae%40nixos
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b1a2bd98-b65f-40f0-b92b-52fe8f34febbn%40googlegroups.com.

Re: [prometheus-users] Prometheus RAM usage investigation

Reply via email to