I will upgrade to the LTS. I did upgrade to the latest helm chart and did see very little difference but I will send you all some metrics and see how we can proceed.
Thanks On Thursday, 2 February 2023 at 00:07:29 UTC+13 Brian Candler wrote: > That makes sense. Hopefully the LTS support for 2.37 can be extended in > the mean time. > > On Wednesday, 1 February 2023 at 10:45:34 UTC Julien Pivotto wrote: > >> On 01 Feb 02:00, Brian Candler wrote: >> > Aside: is 2.42.0 going to be an LTS version? >> >> Hello, >> >> I have not updated the website yet, but 2.42 will not be a LTS version. >> >> My feeling is that we still need a few releases so that the native >> histogram and OOO ingestion "stabilizes". It is not about waiting for >> them to be stable, but more making sure that the eventual bugs >> introduced in the codebase by those two major features are noticed and >> fixed. >> >> >> > >> > On Wednesday, 1 February 2023 at 09:35:00 UTC [email protected] wrote: >> > >> > > Or upgrade to 2.42.0. :) >> > > >> > > On Wed, Feb 1, 2023 at 9:48 AM Julien Pivotto < >> [email protected]> >> > > wrote: >> > > >> > >> On 24 Jan 21:43, Victor Hadianto wrote: >> > >> > > Also, what version(s) of prometheus are these two instances? >> > >> > >> > >> > They are both the same: >> > >> > prometheus, version 2.37.0 (branch: HEAD, revision: >> > >> > b41e0750abf5cc18d8233161560731de05199330) >> > >> >> > >> Please update to 2.37.5. There has been a memory leak fixed in >> 2.37.3. >> > >> >> > >> >> > >> >> > >> > >> > >> > > The RAM usage of Prometheus depends on a number of factors. >> There's a >> > >> > calculator embedded in this article, but it's pretty old now: >> > >> > >> > >> >> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion >> >> > >> > >> > >> > Thanks for this, I'll read & play around with that calculator for >> our >> > >> > Prometheus instances (we have 9 in various clusters now). >> > >> > >> > >> > Regards, >> > >> > Victor >> > >> > >> > >> > >> > >> > On Tue, 24 Jan 2023 at 21:03, Brian Candler <[email protected]> >> wrote: >> > >> > >> > >> > > Also, what version(s) of prometheus are these two instances? >> Different >> > >> > > versions of Prometheus are compiled using different versions of >> Go, >> > >> which >> > >> > > in turn have different degrees of aggressiveness in returning >> unused >> > >> RAM to >> > >> > > the operating system. Also remember Go is a garbage-collected >> > >> language. >> > >> > > >> > >> > > The RAM usage of Prometheus depends on a number of factors. >> There's a >> > >> > > calculator embedded in this article, but it's pretty old now: >> > >> > > >> > >> > > >> > >> >> https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion >> >> > >> > > >> > >> > > On Tuesday, 24 January 2023 at 09:29:47 UTC [email protected] >> wrote: >> > >> > > >> > >> > >> When you say "measured by Kubernetes", what metric >> specifically? >> > >> > >> >> > >> > >> There are several misleading metrics. What matters is >> > >> > >> `container_memory_rss` or `container_memory_working_set_bytes`. >> The >> > >> > >> `container_memmory_usage_bytes` is misleading because it >> includes >> > >> page >> > >> > >> cache values. >> > >> > >> >> > >> > >> On Tue, Jan 24, 2023 at 10:20 AM Victor H <[email protected]> >> wrote: >> > >> > >> >> > >> > >>> Hi, >> > >> > >>> >> > >> > >>> We are running multiple Prometheus instances in Kubernetes >> (deployed >> > >> > >>> using Prometheus Operator) and hope that someone can help us >> > >> understanding >> > >> > >>> why the RAM usage in a few of our instances are unexpectedly >> high >> > >> (we think >> > >> > >>> it's cardinality but not sure where to look) >> > >> > >>> >> > >> > >>> In Prometheus A, we have the following stat: >> > >> > >>> >> > >> > >>> Number of Series: 56486 >> > >> > >>> Number of Chunks: 56684 >> > >> > >>> Number of Label Pairs: 678 >> > >> > >>> >> > >> > >>> tsdb analyze has the following result: >> > >> > >>> >> > >> > >>> /bin $ ./promtool tsdb analyze /prometheus/ >> > >> > >>> Block ID: 01GQGMKZAF548DPE2DFZTF1TRW >> > >> > >>> Duration: 1h59m59.368s >> > >> > >>> Series: 56470 >> > >> > >>> Label names: 26 >> > >> > >>> Postings (unique label pairs): 678 >> > >> > >>> Postings entries (total label pairs): 338705 >> > >> > >>> >> > >> > >>> This instance uses roughly between 4Gb - 5Gb of RAM (measured >> by >> > >> > >>> Kubernetes). >> > >> > >>> >> > >> > >>> From our reading, each time series should use around 8kb of >> RAM so >> > >> for >> > >> > >>> 56k series should be using a mere 500Mb. >> > >> > >>> >> > >> > >>> On a different Prometheus instance (let's call it Prometheus >> > >> Central) we >> > >> > >>> have 1,1m series and it's using 9Gb - 10Gb which is roughly >> what is >> > >> > >>> expected. >> > >> > >>> >> > >> > >>> We're curious about this instance and we believe it's >> cardinality. >> > >> We >> > >> > >>> have a lot more targets in Prometheus A. I also note that the >> > >> Posting >> > >> > >>> entries (total label pairs) is 338k but I'm not sure where to >> look >> > >> for this. >> > >> > >>> >> > >> > >>> The top entries from tsdb analyze is right at the bottom of >> this >> > >> post. >> > >> > >>> The "most common label pairs" entries have alarmingly high >> count, I >> > >> wonder >> > >> > >>> if this contributes the high "total label pairs" and >> consequently >> > >> higher >> > >> > >>> than expected RAM usage. >> > >> > >>> >> > >> > >>> When calculating the expected RAM usage, is the "total label >> pairs" >> > >> is >> > >> > >>> the number we need to use rather than the "total series" >> > >> > >>> >> > >> > >>> Thanks, >> > >> > >>> Victor >> > >> > >>> >> > >> > >>> >> > >> > >>> Label pairs most involved in churning: >> > >> > >>> 296 activity_type=none >> > >> > >>> 258 workflow_type=PodUpdateWorkflow >> > >> > >>> 163 __name__=temporal_request_latency_bucket >> > >> > >>> 104 workflow_type=GenerateSPVarsWorkflow >> > >> > >>> 95 operation=RespondActivityTaskCompleted >> > >> > >>> 89 __name__=temporal_activity_execution_latency_bucket >> > >> > >>> 89 __name__=temporal_activity_schedule_to_start_latency_bucket >> > >> > >>> 65 workflow_type=PodInitWorkflow >> > >> > >>> 53 operation=RespondWorkflowTaskCompleted >> > >> > >>> 49 __name__=temporal_workflow_endtoend_latency_bucket >> > >> > >>> 49 >> __name__=temporal_workflow_task_schedule_to_start_latency_bucket >> > >> > >>> 49 __name__=temporal_workflow_task_execution_latency_bucket >> > >> > >>> 49 __name__=temporal_workflow_task_replay_latency_bucket >> > >> > >>> 39 activity_type=UpdatePodConnectionsActivity >> > >> > >>> 38 le=+Inf >> > >> > >>> 38 le=0.02 >> > >> > >>> 38 le=0.1 >> > >> > >>> 38 le=0.001 >> > >> > >>> 38 activity_type=GenerateSPVarsActivity >> > >> > >>> 38 le=5 >> > >> > >>> >> > >> > >>> Label names most involved in churning: >> > >> > >>> 734 __name__ >> > >> > >>> 734 job >> > >> > >>> 724 instance >> > >> > >>> 577 activity_type >> > >> > >>> 577 workflow_type >> > >> > >>> 541 le >> > >> > >>> 177 operation >> > >> > >>> 95 datname >> > >> > >>> 53 datid >> > >> > >>> 31 mode >> > >> > >>> 29 namespace >> > >> > >>> 21 state >> > >> > >>> 12 quantile >> > >> > >>> 11 container >> > >> > >>> 11 service >> > >> > >>> 11 pod >> > >> > >>> 11 endpoint >> > >> > >>> 10 scrape_job >> > >> > >>> 4 alertname >> > >> > >>> 4 severity >> > >> > >>> >> > >> > >>> Most common label pairs: >> > >> > >>> 23012 activity_type=none >> > >> > >>> 20060 workflow_type=PodUpdateWorkflow >> > >> > >>> 12712 __name__=temporal_request_latency_bucket >> > >> > >>> 8092 workflow_type=GenerateSPVarsWorkflow >> > >> > >>> 7440 operation=RespondActivityTaskCompleted >> > >> > >>> 6944 __name__=temporal_activity_execution_latency_bucket >> > >> > >>> 6944 >> __name__=temporal_activity_schedule_to_start_latency_bucket >> > >> > >>> 5100 workflow_type=PodInitWorkflow >> > >> > >>> 4140 operation=RespondWorkflowTaskCompleted >> > >> > >>> 3864 __name__=temporal_workflow_task_replay_latency_bucket >> > >> > >>> 3864 __name__=temporal_workflow_endtoend_latency_bucket >> > >> > >>> 3864 >> > >> __name__=temporal_workflow_task_schedule_to_start_latency_bucket >> > >> > >>> 3864 __name__=temporal_workflow_task_execution_latency_bucket >> > >> > >>> 3080 activity_type=UpdatePodConnectionsActivity >> > >> > >>> 3004 le=0.5 >> > >> > >>> 3004 le=0.01 >> > >> > >>> 3004 le=0.1 >> > >> > >>> 3004 le=1 >> > >> > >>> 3004 le=0.001 >> > >> > >>> 3004 le=0.002 >> > >> > >>> >> > >> > >>> Label names with highest cumulative label value length: >> > >> > >>> 8312 scrape_job >> > >> > >>> 4279 workflow_type >> > >> > >>> 3994 rule_group >> > >> > >>> 2614 __name__ >> > >> > >>> 2478 instance >> > >> > >>> 1564 job >> > >> > >>> 434 datname >> > >> > >>> 248 activity_type >> > >> > >>> 139 mode >> > >> > >>> 128 operation >> > >> > >>> 109 version >> > >> > >>> 97 pod >> > >> > >>> 88 state >> > >> > >>> 68 service >> > >> > >>> 45 le >> > >> > >>> 44 namespace >> > >> > >>> 43 slice >> > >> > >>> 31 container >> > >> > >>> 28 quantile >> > >> > >>> 18 alertname >> > >> > >>> >> > >> > >>> Highest cardinality labels: >> > >> > >>> 138 instance >> > >> > >>> 138 scrape_job >> > >> > >>> 84 __name__ >> > >> > >>> 75 workflow_type >> > >> > >>> 71 datname >> > >> > >>> 70 job >> > >> > >>> 19 rule_group >> > >> > >>> 14 le >> > >> > >>> 10 activity_type >> > >> > >>> 9 mode >> > >> > >>> 9 quantile >> > >> > >>> 6 state >> > >> > >>> 6 operation >> > >> > >>> 5 datid >> > >> > >>> 4 slice >> > >> > >>> 2 container >> > >> > >>> 2 pod >> > >> > >>> 2 alertname >> > >> > >>> 2 version >> > >> > >>> 2 service >> > >> > >>> >> > >> > >>> Highest cardinality metric names: >> > >> > >>> 12712 temporal_request_latency_bucket >> > >> > >>> 6944 temporal_activity_execution_latency_bucket >> > >> > >>> 6944 temporal_activity_schedule_to_start_latency_bucket >> > >> > >>> 3864 temporal_workflow_task_schedule_to_start_latency_bucket >> > >> > >>> 3864 temporal_workflow_task_replay_latency_bucket >> > >> > >>> 3864 temporal_workflow_task_execution_latency_bucket >> > >> > >>> 3864 temporal_workflow_endtoend_latency_bucket >> > >> > >>> 2448 pg_locks_count >> > >> > >>> 1632 pg_stat_activity_count >> > >> > >>> 908 temporal_request >> > >> > >>> 690 prometheus_target_sync_length_seconds >> > >> > >>> 496 temporal_activity_execution_latency_count >> > >> > >>> 350 go_gc_duration_seconds >> > >> > >>> 340 pg_stat_database_tup_inserted >> > >> > >>> 340 pg_stat_database_temp_bytes >> > >> > >>> 340 pg_stat_database_xact_commit >> > >> > >>> 340 pg_stat_database_xact_rollback >> > >> > >>> 340 pg_stat_database_tup_updated >> > >> > >>> 340 pg_stat_database_deadlocks >> > >> > >>> 340 pg_stat_database_tup_returned >> > >> > >>> >> > >> > >>> >> > >> > >>> >> > >> > >>> >> > >> > >>> >> > >> > >>> >> > >> > >>> -- >> > >> > >>> You received this message because you are subscribed to the >> Google >> > >> > >>> Groups "Prometheus Users" group. >> > >> > >>> To unsubscribe from this group and stop receiving emails from >> it, >> > >> send >> > >> > >>> an email to [email protected]. >> > >> > >>> To view this discussion on the web visit >> > >> > >>> >> > >> >> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com >> >> > >> > >>> < >> > >> >> https://groups.google.com/d/msgid/prometheus-users/59f74cb9-3135-4fc3-a7e7-9bec02a3143an%40googlegroups.com?utm_medium=email&utm_source=footer >> >> > >> > >> > >> > >>> . >> > >> > >>> >> > >> > >> -- >> > >> > > You received this message because you are subscribed to a topic >> in the >> > >> > > Google Groups "Prometheus Users" group. >> > >> > > To unsubscribe from this topic, visit >> > >> > > >> > >> >> https://groups.google.com/d/topic/prometheus-users/_yUpPWtFaQA/unsubscribe >> > >> > > . >> > >> > > To unsubscribe from this group and all its topics, send an email >> to >> > >> > > [email protected]. >> > >> > > To view this discussion on the web visit >> > >> > > >> > >> >> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com >> >> > >> > > < >> > >> >> https://groups.google.com/d/msgid/prometheus-users/9a2d7848-4f4f-43b9-90f4-765367f33c47n%40googlegroups.com?utm_medium=email&utm_source=footer >> >> > >> > >> > >> > > . >> > >> > > >> > >> > >> > >> > -- >> > >> > You received this message because you are subscribed to the Google >> > >> Groups "Prometheus Users" group. >> > >> > To unsubscribe from this group and stop receiving emails from it, >> send >> > >> an email to [email protected]. >> > >> > To view this discussion on the web visit >> > >> >> https://groups.google.com/d/msgid/prometheus-users/CANP6zPKHQkSZPcQ%3Dcj1obbq4RfcnnE_eOJqEkYtvEwOqAE6EgQ%40mail.gmail.com >> >> > >> . >> > >> >> > >> -- >> > >> Julien Pivotto >> > >> @roidelapluie >> > >> >> > >> -- >> > >> You received this message because you are subscribed to the Google >> Groups >> > >> "Prometheus Users" group. >> > >> To unsubscribe from this group and stop receiving emails from it, >> send an >> > >> email to [email protected]. >> > >> >> > > To view this discussion on the web visit >> > >> >> https://groups.google.com/d/msgid/prometheus-users/Y9onaJkBb8Quugae%40nixos >> > >> . >> > >> >> > > >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "Prometheus Users" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/b1a2bd98-b65f-40f0-b92b-52fe8f34febbn%40googlegroups.com. >> >> >> >> >> -- >> Julien Pivotto >> @roidelapluie >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/1ed39f61-0c8d-4926-af6a-84adcec8f35bn%40googlegroups.com.

