On Saturday, 30 July 2022 at 04:21:09 UTC+1 [email protected] wrote:
> Using ranged vector I could derive something like this to calculate uptime
> for an individual node.
>
> sum_over_time((platform_uptime_state{node_id ="101"})[1h:15s]) /
> count_over_time((platform_uptime_state{node_id ="101"})[1h:15s])
>
Aside: avg_over_time is simpler.
Also in this particular case, there's no need for a subquery. A native
range vector will work:
avg_over_time(platform_uptime_state{node_id ="101"}[1h])
There is a subtle difference: this doesn't resample the metric at 15 second
intervals, but just takes all the existing data points in the timeseries
over that range, with whatever timestamps they were recorded at.
But you would need a subquery if the expression is any more complex than
just a plain metric, as you'll see shortly.
But here's the formula, I'm trying to implement:
>
>
>
>
>
> *count of time series when the cluster has at least 1 node up over 1d(eg.
> sum by (cluster_id) (platform_uptime_state) == 0)/count of total cluster ts
> over 1d*
> But this doesn't work,
>
In what way "doesn't it work"? What output do you get? What happens if you
try graphing the numerator and denominator separately? Or is the problem
you don't know what to put for "count of total cluster ts over 1d" ?
I suggest you build the query up in stages, testing the query so far in the
Prometheus web interface at each stage. If you're trying to detect when the
cluster has *at least* one node up, then I'd start with a query like this:
max(platform_uptime_state)
That gives a single value across all nodes. If you have multiple clusters,
then it would be:
max by (cluster_id) (platform_uptime_state)
Graph that. Check that it it gives one value per cluster, and the value is
0 when all nodes in that cluster are down and 1 when at least 1 node is up.
Once you're happy with that, try a subquery to evaluate this expression
multiple times over the previous hour:
max by (cluster_id) (platform_uptime_state))[1h:15s])
Does that work? (Note: the result is a range vector and the "graph" view in
the web interface can't show this, but the "table" view will show the data
points)
Now try adding up those points over the hour:
sum_over_time((max by (cluster_id) (platform_uptime_state))[1h:15s])
Is that correct? The result is an instant vector so you should be able to
graph this one. At each point in the graph, it shows the result for the
time from T-1h to T.
Now you know that 3600/15 = 240, so you could divide by 240, but it's
simpler to change to
avg_over_time((max by (cluster_id) (platform_uptime_state))[1h:15s])
If that doesn't produce what you're looking for, then you can still follow
the same logical process to end up with an expression which does.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/b2e5a866-74ce-4b1d-8d08-ad03606d3903n%40googlegroups.com.