Re: [prometheus-users] Prometheus RAM usage investigation

Julien Pivotto Wed, 01 Feb 2023 00:48:49 -0800

On 24 Jan 21:43, Victor Hadianto wrote:
> > Also, what version(s) of prometheus are these two instances?
> 
> They are both the same:
> prometheus, version 2.37.0 (branch: HEAD, revision:
> b41e0750abf5cc18d8233161560731de05199330)


Please update to 2.37.5. There has been a memory leak fixed in 2.37.3.



> 
> > The RAM usage of Prometheus depends on a number of factors. There's a
> calculator embedded in this article, but it's pretty old now:
> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
> 
> Thanks for this, I'll read & play around with that calculator for our
> Prometheus instances (we have 9 in various clusters now).
> 
> Regards,
> Victor
> 
> 
> On Tue, 24 Jan 2023 at 21:03, Brian Candler <[email protected]> wrote:
> 
> > Also, what version(s) of prometheus are these two instances? Different
> > versions of Prometheus are compiled using different versions of Go, which
> > in turn have different degrees of aggressiveness in returning unused RAM to
> > the operating system. Also remember Go is a garbage-collected language.
> >
> > The RAM usage of Prometheus depends on a number of factors. There's a
> > calculator embedded in this article, but it's pretty old now:
> >
> > https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
> >
> > On Tuesday, 24 January 2023 at 09:29:47 UTC [email protected] wrote:
> >
> >> When you say "measured by Kubernetes", what metric specifically?
> >>
> >> There are several misleading metrics. What matters is
> >> `container_memory_rss` or `container_memory_working_set_bytes`. The
> >> `container_memmory_usage_bytes` is misleading because it includes page
> >> cache values.
> >>
> >> On Tue, Jan 24, 2023 at 10:20 AM Victor H <[email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> We are running multiple Prometheus instances in Kubernetes (deployed
> >>> using Prometheus Operator) and hope that someone can help us understanding
> >>> why the RAM usage in a few of our instances are unexpectedly high (we 
> >>> think
> >>> it's cardinality but not sure where to look)
> >>>
> >>> In Prometheus A, we have the following stat:
> >>>
> >>> Number of Series: 56486
> >>> Number of Chunks: 56684
> >>> Number of Label Pairs: 678
> >>>
> >>> tsdb analyze has the following result:
> >>>
> >>> /bin $ ./promtool tsdb analyze /prometheus/
> >>> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW
> >>> Duration: 1h59m59.368s
> >>> Series: 56470
> >>> Label names: 26
> >>> Postings (unique label pairs): 678
> >>> Postings entries (total label pairs): 338705
> >>>
> >>> This instance uses roughly between 4Gb - 5Gb of RAM (measured by
> >>> Kubernetes).
> >>>
> >>> From our reading, each time series should use around 8kb of RAM so for
> >>> 56k series should be using a mere 500Mb.
> >>>
> >>> On a different Prometheus instance (let's call it Prometheus Central) we
> >>> have 1,1m series and it's using 9Gb - 10Gb which is roughly what is
> >>> expected.
> >>>
> >>> We're curious about this instance and we believe it's cardinality. We
> >>> have a lot more targets in Prometheus A. I also note that the Posting
> >>> entries (total label pairs) is 338k but I'm not sure where to look for 
> >>> this.
> >>>
> >>> The top entries from tsdb analyze is right at the bottom of this post.
> >>> The "most common label pairs" entries have alarmingly high count, I wonder
> >>> if this contributes the high "total label pairs" and consequently higher
> >>> than expected RAM usage.
> >>>
> >>> When calculating the expected RAM usage, is the "total label pairs" is
> >>> the number we need to use rather than the "total series"
> >>>
> >>> Thanks,
> >>> Victor
> >>>
> >>>
> >>> Label pairs most involved in churning:
> >>> 296 activity_type=none
> >>> 258 workflow_type=PodUpdateWorkflow
> >>> 163 __name__=temporal_request_latency_bucket
> >>> 104 workflow_type=GenerateSPVarsWorkflow
> >>> 95 operation=RespondActivityTaskCompleted
> >>> 89 __name__=temporal_activity_execution_latency_bucket
> >>> 89 __name__=temporal_activity_schedule_to_start_latency_bucket
> >>> 65 workflow_type=PodInitWorkflow
> >>> 53 operation=RespondWorkflowTaskCompleted
> >>> 49 __name__=temporal_workflow_endtoend_latency_bucket
> >>> 49 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
> >>> 49 __name__=temporal_workflow_task_execution_latency_bucket
> >>> 49 __name__=temporal_workflow_task_replay_latency_bucket
> >>> 39 activity_type=UpdatePodConnectionsActivity
> >>> 38 le=+Inf
> >>> 38 le=0.02
> >>> 38 le=0.1
> >>> 38 le=0.001
> >>> 38 activity_type=GenerateSPVarsActivity
> >>> 38 le=5
> >>>
> >>> Label names most involved in churning:
> >>> 734 __name__
> >>> 734 job
> >>> 724 instance
> >>> 577 activity_type
> >>> 577 workflow_type
> >>> 541 le
> >>> 177 operation
> >>> 95 datname
> >>> 53 datid
> >>> 31 mode
> >>> 29 namespace
> >>> 21 state
> >>> 12 quantile
> >>> 11 container
> >>> 11 service
> >>> 11 pod
> >>> 11 endpoint
> >>> 10 scrape_job
> >>> 4 alertname
> >>> 4 severity
> >>>
> >>> Most common label pairs:
> >>> 23012 activity_type=none
> >>> 20060 workflow_type=PodUpdateWorkflow
> >>> 12712 __name__=temporal_request_latency_bucket
> >>> 8092 workflow_type=GenerateSPVarsWorkflow
> >>> 7440 operation=RespondActivityTaskCompleted
> >>> 6944 __name__=temporal_activity_execution_latency_bucket
> >>> 6944 __name__=temporal_activity_schedule_to_start_latency_bucket
> >>> 5100 workflow_type=PodInitWorkflow
> >>> 4140 operation=RespondWorkflowTaskCompleted
> >>> 3864 __name__=temporal_workflow_task_replay_latency_bucket
> >>> 3864 __name__=temporal_workflow_endtoend_latency_bucket
> >>> 3864 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
> >>> 3864 __name__=temporal_workflow_task_execution_latency_bucket
> >>> 3080 activity_type=UpdatePodConnectionsActivity
> >>> 3004 le=0.5
> >>> 3004 le=0.01
> >>> 3004 le=0.1
> >>> 3004 le=1
> >>> 3004 le=0.001
> >>> 3004 le=0.002
> >>>
> >>> Label names with highest cumulative label value length:
> >>> 8312 scrape_job
> >>> 4279 workflow_type
> >>> 3994 rule_group
> >>> 2614 __name__
> >>> 2478 instance
> >>> 1564 job
> >>> 434 datname
> >>> 248 activity_type
> >>> 139 mode
> >>> 128 operation
> >>> 109 version
> >>> 97 pod
> >>> 88 state
> >>> 68 service
> >>> 45 le
> >>> 44 namespace
> >>> 43 slice
> >>> 31 container
> >>> 28 quantile
> >>> 18 alertname
> >>>
> >>> Highest cardinality labels:
> >>> 138 instance
> >>> 138 scrape_job
> >>> 84 __name__
> >>> 75 workflow_type
> >>> 71 datname
> >>> 70 job
> >>> 19 rule_group
> >>> 14 le
> >>> 10 activity_type
> >>> 9 mode
> >>> 9 quantile
> >>> 6 state
> >>> 6 operation
> >>> 5 datid
> >>> 4 slice
> >>> 2 container
> >>> 2 pod
> >>> 2 alertname
> >>> 2 version
> >>> 2 service
> >>>
> >>> Highest cardinality metric names:
> >>> 12712 temporal_request_latency_bucket
> >>> 6944 temporal_activity_execution_latency_bucket
> >>> 6944 temporal_activity_schedule_to_start_latency_bucket
> >>> 3864 temporal_workflow_task_schedule_to_start_latency_bucket
> >>> 3864 temporal_workflow_task_replay_latency_bucket
> >>> 3864 temporal_workflow_task_execution_latency_bucket
> >>> 3864 temporal_workflow_endtoend_latency_bucket
> >>> 2448 pg_locks_count
> >>> 1632 pg_stat_activity_count
> >>> 908 temporal_request
> >>> 690 prometheus_target_sync_length_seconds
> >>> 496 temporal_activity_execution_latency_count
> >>> 350 go_gc_duration_seconds
> >>> 340 pg_stat_database_tup_inserted
> >>> 340 pg_stat_database_temp_bytes
> >>> 340 pg_stat_database_xact_commit
> >>> 340 pg_stat_database_xact_rollback
> >>> 340 pg_stat_database_tup_updated
> >>> 340 pg_stat_database_deadlocks
> >>> 340 pg_stat_database_tup_returned
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "Prometheus Users" group.
> >>> To unsubscribe from this group and stop receiving emails from it, send
> >>> an email to [email protected].
> >>> To view this discussion on the web visit
> >>> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com
> >>> <https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer>
> >>> .
> >>>
> >> --
> > You received this message because you are subscribed to a topic in the
> > Google Groups "Prometheus Users" group.
> > To unsubscribe from this topic, visit
> > https://groups.google.com/d/topic/prometheus-users/_yUpPWtFaQA/unsubscribe
> > .
> > To unsubscribe from this group and all its topics, send an email to
> > [email protected].
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com
> > <https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .
> >
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/CANP6zPKHQkSZPcQ%3Dcj1obbq4RfcnnE_eOJqEkYtvEwOqAE6EgQ%40mail.gmail.com.

-- 
Julien Pivotto
@roidelapluie

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/Y9onaJkBb8Quugae%40nixos.

Re: [prometheus-users] Prometheus RAM usage investigation

Reply via email to