Hi Guyz,
I have been using Prometheus for few years with almost no issues,
recently i am witnessing performance issues while spinning the queries via
Grafana dashboard.
Prometheus container sustains to server 12hrs of metrics without any
issues,
if i extend it to 24hrs of metrics then it gets restarted with the
following info ( sorry for the long trail... ).
Please, do see few manually ran query metrics at the end of the logs for
the reference.
level=info ts=2022-04-13T05:15:17.798Z caller=main.go:851
fs_type=EXT4_SUPER_MAGIC
level=info ts=2022-04-13T05:15:17.798Z caller=main.go:854 msg="TSDB started"
level=info ts=2022-04-13T05:15:17.798Z caller=main.go:981 msg="Loading
configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2022-04-13T05:15:17.804Z caller=main.go:1012 msg="Completed
loading of configuration file" filename=/etc/prometheus/prometheus.yml
totalDuration=5.955772ms remote_storage=1.608µs web_handler=397ns
query_engine=1.065µs scrape=569.244µs scrape_sd=637.764µs notify=24.31µs
notify_sd=18.627µs rules=1.984671ms
level=info ts=2022-04-13T05:15:17.804Z caller=main.go:796 msg="Server is
ready to receive web requests."
level=warn ts=2022-04-13T06:12:38.025Z caller=main.go:378
deprecation_notice="'storage.tsdb.retention' flag is deprecated use
'storage.tsdb.retention.time' instead."
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:443 msg="Starting
Prometheus" version="(version=2.28.1, branch=HEAD,
revision=b0944590a1c9a6b35dc5a696869f75f422b107a1)"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:448
build_context="(go=go1.16.5, user=xxxxx, date=20210701-15:20:10)"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:449
host_details="(Linux 4.14.262-200.489.amzn2.x86_64 #1 SMP Fri Feb 4
20:34:30 UTC 2022 x86_64 (none))"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:450
fd_limits="(soft=32768, hard=65536)"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:451
vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-04-13T06:12:38.026Z caller=query_logger.go:79
component=activeQueryTracker msg="These queries didn't finish in
prometheus' last run:"
queries="[{\"query\":\"count(proftpd_cpu_usage{process=~\\\".*proftpd.*\\\",job=\\\"sftp_top_prod\\\",app!=\\\"-c\\\",app!=\\\"\\u003cdefunct\\u003e\\\",app!=\\\"connected:\\\"})
by (app) \\u003e
5\",\"timestamp_sec\":1649830310},{\"query\":\"count(proftpd_cpu_usage{job=\\\"sftp_top_prod\\\",process=~\\\".*proftpd.*\\\",clientip!=\\\"-nd\\\",
app!=\\\"\\u003cdefunct\\u003e\\\"}) by (clientip,app) \\u003e
5\",\"timestamp_sec\":1649830316},{\"query\":\"count(proftpd_cpu_usage{process=~\\\".*proftpd.*\\\",job=\\\"sftp_top_prod\\\"})\",\"timestamp_sec\":1649830310},{\"query\":\"count(proftpd_cpu_usage{process=~\\\".*proftpd.*\\\",job=\\\"sftp_top_prod\\\"})
by
(instance)\",\"timestamp_sec\":1649830310},{\"query\":\"count(proftpd_cpu_usage{job=\\\"sftp_top_prod\\\",
process=~\\\".*proftpd.*\\\"}) by (clientip) \\u003e
5\",\"timestamp_sec\":1649830311}]"
level=info ts=2022-04-13T06:12:38.028Z caller=web.go:541 component=web
msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-04-13T06:12:38.029Z caller=main.go:824 msg="Starting
TSDB ..."
level=info ts=2022-04-13T06:12:38.031Z caller=tls_config.go:191
component=web msg="TLS is disabled." http2=false
level=info ts=2022-04-13T06:12:38.031Z caller=repair.go:57 component=tsdb
msg="Found healthy block" mint=1641708000000 maxt=1642291200000
ulid=01FSGDEHH869s5PYJ3HQZ8GEVNx5
level=info ts=2022-04-13T06:12:38.032Z caller=repair.go:57 component=tsdb
msg="Found healthy block" mint=1642291200000 maxt=1642874400000
ulid=01FT1SMMQZNXxAAEJS6NPDD32Sx3
.....
level=warn ts=2022-04-13T06:12:38.048Z caller=db.go:676 component=tsdb
msg="A TSDB lockfile from a previous execution already existed. It was
replaced" file=/prometheus/lock
level=info ts=2022-04-13T06:12:40.121Z caller=head.go:780 component=tsdb
msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2022-04-13T06:12:43.525Z caller=head.go:794 component=tsdb
msg="On-disk memory mappable chunks replay completed" duration=3.4045956s
level=info ts=2022-04-13T06:12:43.526Z caller=head.go:800 component=tsdb
msg="Replaying WAL, this may take a while"
level=info ts=2022-04-13T06:12:48.512Z caller=head.go:826 component=tsdb
msg="WAL checkpoint loaded"
level=info ts=2022-04-13T06:12:49.978Z caller=head.go:854 component=tsdb
msg="WAL segment loaded" segment=133826 maxSegment=133854
........
level=info ts=2022-04-13T06:13:33.511Z caller=head.go:860 component=tsdb
msg="WAL replay completed" checkpoint_replay_duration=4.986518678s
wal_replay_duration=44.999189293s total_replay_duration=53.390369759s
level=info ts=2022-04-13T06:13:35.315Z caller=main.go:851
fs_type=EXT4_SUPER_MAGIC
level=info ts=2022-04-13T06:13:35.315Z caller=main.go:854 msg="TSDB started"
level=info ts=2022-04-13T06:13:35.315Z caller=main.go:981 msg="Loading
configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2022-04-13T06:13:35.324Z caller=main.go:1012 msg="Completed
loading of configuration file" filename=/etc/prometheus/prometheus.yml
totalDuration=9.159204ms remote_storage=2.044µs web_handler=395ns
query_engine=1.047µs scrape=1.011248ms scrape_sd=2.974467ms notify=16.737µs
notify_sd=9.337µs rules=2.452176ms
level=info ts=2022-04-13T06:13:35.324Z caller=main.go:796 msg="Server is
ready to receive web requests.
----
Query references:
prometheus_tsdb_data_replay_duration_seconds shows increase from 25 sec to
around 50sec during the failure
prometheus_engine_queries shows 5 from 0 during the search and
prometheus_engine_queries_concurrent_max shows 20
prometheus_engine_query_duration_seconds{instance="abc:80",
job="prometheus", quantile="0.99", slice="inner_eval"} 7.31
prometheus_engine_query_duration_seconds{instance="abc:80",
job="prometheus", quantile="0.99", slice="prepare_time"} 0.013609492
prometheus_engine_query_duration_seconds{instance="abc:80",
job="prometheus", quantile="0.99", slice="queue_time"} 0.000030745
prometheus_engine_query_duration_seconds{instance="prometheus.prod.datadelivery.info:80",
job="prometheus", quantile="0.99", slice="result_sort"} 0.000040619
I could see RAM is running with its full capacity and container gets
restarted after it.
EBS volume am using is ext4, i pressume this could be one of the reason for
the slow IO, which i can give a try to change to GP2.
Increasing the RAM is not feasible solution as its utilization during
normal times not even reaches half of its total capacity (30Gb).
Appreciate your quick advice.
let me know for any further more details.
Regards,
Naveen.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/b0943227-71e2-49dc-a39f-89ac70f206edn%40googlegroups.com.