[prometheus-users] Re: Calculating Cluster uptime % for two node cluster

Brian Candler Sat, 30 Jul 2022 00:37:27 -0700

On Saturday, 30 July 2022 at 04:21:09 UTC+1 [email protected] wrote:

> Using ranged vector I could derive something like this to calculate uptime 
> for an individual node.
>
> sum_over_time((platform_uptime_state{node_id ="101"})[1h:15s]) / 
> count_over_time((platform_uptime_state{node_id ="101"})[1h:15s])
>


Aside: avg_over_time is simpler.

Also in this particular case, there's no need for a subquery. A native 
range vector will work:

avg_over_time(platform_uptime_state{node_id ="101"}[1h])

There is a subtle difference: this doesn't resample the metric at 15 second 
intervals, but just takes all the existing data points in the timeseries 
over that range, with whatever timestamps they were recorded at.

But you would need a subquery if the expression is any more complex than 
just a plain metric, as you'll see shortly.


But here's the formula, I'm trying to implement: 
>
>
>
>
>
> *count of time series when the cluster has at least 1 node up over 1d(eg. 
> sum by (cluster_id) (platform_uptime_state) == 0)/count of total cluster ts 
> over 1d*
> But this doesn't work, 
>

In what way "doesn't it work"? What output do you get? What happens if you 
try graphing the numerator and denominator separately?  Or is the problem 
you don't know what to put for "count of total cluster ts over 1d" ?

I suggest you build the query up in stages, testing the query so far in the 
Prometheus web interface at each stage. If you're trying to detect when the 
cluster has *at least* one node up, then I'd start with a query like this:

max(platform_uptime_state)

That gives a single value across all nodes.  If you have multiple clusters, 
then it would be:

max by (cluster_id) (platform_uptime_state)

Graph that. Check that it it gives one value per cluster, and the value is 
0 when all nodes in that cluster are down and 1 when at least 1 node is up.

Once you're happy with that, try a subquery to evaluate this expression 
multiple times over the previous hour:

max by (cluster_id) (platform_uptime_state))[1h:15s])

Does that work? (Note: the result is a range vector and the "graph" view in 
the web interface can't show this, but the "table" view will show the data 
points)

Now try adding up those points over the hour:

sum_over_time((max by (cluster_id) (platform_uptime_state))[1h:15s])

Is that correct? The result is an instant vector so you should be able to 
graph this one.  At each point in the graph, it shows the result for the 
time from T-1h to T.

Now you know that 3600/15 = 240, so you could divide by 240, but it's 
simpler to change to

avg_over_time((max by (cluster_id) (platform_uptime_state))[1h:15s])

If that doesn't produce what you're looking for, then you can still follow 
the same logical process to end up with an expression which does.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b2e5a866-74ce-4b1d-8d08-ad03606d3903n%40googlegroups.com.

[prometheus-users] Re: Calculating Cluster uptime % for two node cluster

Reply via email to