Hello Julius, Thank you a lot for your reply. When I checked information about lookback delta previously I assumed that the graph should also show the missing results. So, if there is no datapoint it will be shown as a gap on graph. And graph showed non-interrupted line, so I did not consider checking it now. I see that I could be wrong. I also did not consider using "min_over_time" expression previously, while it looks useful.
I will definitely try suggested changes. Thank you again. Have a great time of the day. On Tuesday, June 20, 2023 at 1:28:54 PM UTC+3 Julius Volz wrote: > Hi Lena, > > One thing I see is that your scrape interval is very long: 300s, which is > exactly 5 minutes. The lookback delta of an instant vector selector is also > exactly 5 minutes (see https://www.youtube.com/watch?v=xIAEEQwUBXQ&t=272s), > which means that the selector will stop returning a result if there is ever > a case where there is no datapoint at least 5 minutes prior to the current > rule evaluation timestamp. That would reset the "for" duration again. With > a 5-minute scrape interval, that can indeed happen to you at times (either > just a bit of a delay in scraping or in ingesting scraped samples, or even > an occasional failed scrape). I'd recommend setting the interval short > enough that you can tolerate an occasional failed scrape (like 2m). Does > the problem go away with a shorter interval? > > By the way: 24h is quite a long "for" duration. If the series is ever > absent for an even longer period during those 24h (like if the exporter is > down for a couple of minutes), your alerts will always reset again. An > alternative could be to alert on an expression like > "min_over_time(database_disk_usage_bytes[24h]) > 15 * 1024 * 1024 * 1024" > with a much shorter "for" duration. But some "for" duration is still a good > idea, in the case of a fresh Prometheus server that doesn't have 24h of > data yet. That way, the alert would become less reliant on perfect scrape / > exporter behavior over a full 24h window. > > Regards, > Julius > > On Tue, Jun 20, 2023 at 10:24 AM Lena <[email protected]> wrote: > >> Hello, >> I hope you can help me with the issue I faced. >> I use disk_usage_exporter >> <https://github.com/dundee/disk_usage_exporter/> to get metrics about >> database sizes. The metrics are gathered by Prometheus each 5 min. The >> servicemonitor configuration is: >> - interval: 300s >> metricRelabelings: >> - action: replace >> regex: node_disk_usage_bytes >> replacement: database_disk_usage_bytes >> sourceLabels: >> - __name__ >> targetLabel: __name__ >> path: /metrics >> port: disk-exporter >> relabelings: >> - action: replace >> regex: (.+)-mysql-slave >> replacement: $1 >> sourceLabels: >> - service >> targetLabel: cluster >> scrapeTimeout: 120s >> Then I have an alert to notify if some database has size of more than >> 15GB for 24hours: >> - alert: MySQLDatabaseSize >> expr: database_disk_usage_bytes > 15 * 1024 * 1024 * 1024 >> for: 24h >> labels: >> severity: warning >> annotations: >> dashboard: database-disk-usage?var-cluster={{ $labels.cluster }} >> description: MySQL database `{{ $labels.path |reReplaceAll >> "/var/lib/mysql/" "" }}` takes `{{ $value | humanize }}` of disk space on >> pod `{{ $labels.pod }}` >> summary: MySQL database has grown too big. >> >> On testing environment the alert fires properly. However on production >> environment it never fires, being stuck in Pending state, as `Active Since` >> time is being updated every ~5min. >> The only difference between environments is the number of databases in >> cluster. >> Below you can see screenshot of `Active Since` time, you see that time >> changes: >> [image: active_since1.png][image: active_since2.png] >> The metric labels are not changing. The graph is stable, so there are no >> missed metrics or gaps where database size is not defined. >> [image: graph.png] >> >> Scrape time takes ~20-40sec, however it's still within scrapeTimeout: >> 120sec >> >> The rule evaluation takes 1-2sec with evaluation_interval: 30sec >> >> Prometheus version is 2.22.1 >> >> I see no related errors in Prometheus logs and have no clue what can be >> the reason of the issue. >> >> Thank you for any advise. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > Julius Volz > PromLabs - promlabs.com > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4cdd6627-1559-4ff8-b9b1-9661adcaab58n%40googlegroups.com.

