Re: [prometheus-users] Prometheus RAM usage investigation

Ben Kochie Tue, 24 Jan 2023 01:29:46 -0800

When you say "measured by Kubernetes", what metric specifically?

There are several misleading metrics. What matters is
`container_memory_rss` or `container_memory_working_set_bytes`. The
`container_memmory_usage_bytes` is misleading because it includes page
cache values.


On Tue, Jan 24, 2023 at 10:20 AM Victor H <[email protected]> wrote:

> Hi,
>
> We are running multiple Prometheus instances in Kubernetes (deployed using
> Prometheus Operator) and hope that someone can help us understanding why
> the RAM usage in a few of our instances are unexpectedly high (we think
> it's cardinality but not sure where to look)
>
> In Prometheus A, we have the following stat:
>
> Number of Series: 56486
> Number of Chunks: 56684
> Number of Label Pairs: 678
>
> tsdb analyze has the following result:
>
> /bin $ ./promtool tsdb analyze /prometheus/
> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW
> Duration: 1h59m59.368s
> Series: 56470
> Label names: 26
> Postings (unique label pairs): 678
> Postings entries (total label pairs): 338705
>
> This instance uses roughly between 4Gb - 5Gb of RAM (measured by
> Kubernetes).
>
> From our reading, each time series should use around 8kb of RAM so for 56k
> series should be using a mere 500Mb.
>
> On a different Prometheus instance (let's call it Prometheus Central) we
> have 1,1m series and it's using 9Gb - 10Gb which is roughly what is
> expected.
>
> We're curious about this instance and we believe it's cardinality. We have
> a lot more targets in Prometheus A. I also note that the Posting entries
> (total label pairs) is 338k but I'm not sure where to look for this.
>
> The top entries from tsdb analyze is right at the bottom of this post. The
> "most common label pairs" entries have alarmingly high count, I wonder if
> this contributes the high "total label pairs" and consequently higher than
> expected RAM usage.
>
> When calculating the expected RAM usage, is the "total label pairs" is the
> number we need to use rather than the "total series"
>
> Thanks,
> Victor
>
>
> Label pairs most involved in churning:
> 296 activity_type=none
> 258 workflow_type=PodUpdateWorkflow
> 163 __name__=temporal_request_latency_bucket
> 104 workflow_type=GenerateSPVarsWorkflow
> 95 operation=RespondActivityTaskCompleted
> 89 __name__=temporal_activity_execution_latency_bucket
> 89 __name__=temporal_activity_schedule_to_start_latency_bucket
> 65 workflow_type=PodInitWorkflow
> 53 operation=RespondWorkflowTaskCompleted
> 49 __name__=temporal_workflow_endtoend_latency_bucket
> 49 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
> 49 __name__=temporal_workflow_task_execution_latency_bucket
> 49 __name__=temporal_workflow_task_replay_latency_bucket
> 39 activity_type=UpdatePodConnectionsActivity
> 38 le=+Inf
> 38 le=0.02
> 38 le=0.1
> 38 le=0.001
> 38 activity_type=GenerateSPVarsActivity
> 38 le=5
>
> Label names most involved in churning:
> 734 __name__
> 734 job
> 724 instance
> 577 activity_type
> 577 workflow_type
> 541 le
> 177 operation
> 95 datname
> 53 datid
> 31 mode
> 29 namespace
> 21 state
> 12 quantile
> 11 container
> 11 service
> 11 pod
> 11 endpoint
> 10 scrape_job
> 4 alertname
> 4 severity
>
> Most common label pairs:
> 23012 activity_type=none
> 20060 workflow_type=PodUpdateWorkflow
> 12712 __name__=temporal_request_latency_bucket
> 8092 workflow_type=GenerateSPVarsWorkflow
> 7440 operation=RespondActivityTaskCompleted
> 6944 __name__=temporal_activity_execution_latency_bucket
> 6944 __name__=temporal_activity_schedule_to_start_latency_bucket
> 5100 workflow_type=PodInitWorkflow
> 4140 operation=RespondWorkflowTaskCompleted
> 3864 __name__=temporal_workflow_task_replay_latency_bucket
> 3864 __name__=temporal_workflow_endtoend_latency_bucket
> 3864 __name__=temporal_workflow_task_schedule_to_start_latency_bucket
> 3864 __name__=temporal_workflow_task_execution_latency_bucket
> 3080 activity_type=UpdatePodConnectionsActivity
> 3004 le=0.5
> 3004 le=0.01
> 3004 le=0.1
> 3004 le=1
> 3004 le=0.001
> 3004 le=0.002
>
> Label names with highest cumulative label value length:
> 8312 scrape_job
> 4279 workflow_type
> 3994 rule_group
> 2614 __name__
> 2478 instance
> 1564 job
> 434 datname
> 248 activity_type
> 139 mode
> 128 operation
> 109 version
> 97 pod
> 88 state
> 68 service
> 45 le
> 44 namespace
> 43 slice
> 31 container
> 28 quantile
> 18 alertname
>
> Highest cardinality labels:
> 138 instance
> 138 scrape_job
> 84 __name__
> 75 workflow_type
> 71 datname
> 70 job
> 19 rule_group
> 14 le
> 10 activity_type
> 9 mode
> 9 quantile
> 6 state
> 6 operation
> 5 datid
> 4 slice
> 2 container
> 2 pod
> 2 alertname
> 2 version
> 2 service
>
> Highest cardinality metric names:
> 12712 temporal_request_latency_bucket
> 6944 temporal_activity_execution_latency_bucket
> 6944 temporal_activity_schedule_to_start_latency_bucket
> 3864 temporal_workflow_task_schedule_to_start_latency_bucket
> 3864 temporal_workflow_task_replay_latency_bucket
> 3864 temporal_workflow_task_execution_latency_bucket
> 3864 temporal_workflow_endtoend_latency_bucket
> 2448 pg_locks_count
> 1632 pg_stat_activity_count
> 908 temporal_request
> 690 prometheus_target_sync_length_seconds
> 496 temporal_activity_execution_latency_count
> 350 go_gc_duration_seconds
> 340 pg_stat_database_tup_inserted
> 340 pg_stat_database_temp_bytes
> 340 pg_stat_database_xact_commit
> 340 pg_stat_database_xact_rollback
> 340 pg_stat_database_tup_updated
> 340 pg_stat_database_deadlocks
> 340 pg_stat_database_tup_returned
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmo1Ve7K6JL9YTjWvwjd1Lw5X5nV_GR5jhjg_jMsUWzJ%2Bw%40mail.gmail.com.

Re: [prometheus-users] Prometheus RAM usage investigation

Reply via email to