Hello Brian,

Sorry for the late response.
I already have plenty of dashboards in Grafana for various parts of our 
infrastructure, alerts and thresholds works well, and having an actual 
value helps us finding the source of our problems as you say. However, the 
particular dashboard I'm crafting is aimed at the executives and other 
partners than demands an availability counter for our infrastructure as a 
whole.
So for this particular dashboard, the question is not "is something broken 
and why is it broken?" but just "is everything working and if not, what 
broke and when?".
I should have made it a bit clearer, sorry.

The few queries you gave me helped me a lot actually! I never used a bool 
in my queries before and never bothered to use it until you mentioned it.
So now I use home-made recording rules for the various parts of the 
infrastructure, mainly containing min/max/max_over_time/bool and a few 
conditions. I get a nice load of 0s and 1s everywhere and it's very easy 
now to get a global % of availability for a period of time.
The state timeline panel in Grafana is also very useful.

Thanks for your help Brian :)
Le jeudi 18 novembre 2021 à 09:57:37 UTC+1, Brian Candler a écrit :

> You're probably looking at it the wrong way, and I expect you should 
> configure Grafana to visualise correctly the response you have.
>
> You can display or not display something in Grafana based on 
> presence/absence of any value.  However usually it's more useful to *see* the 
> actual failing value, because an indication of just "not healthy" doesn't 
> give you any clue to help debug the problem.  One thing you can do in 
> Grafana is to set thresholds and colours: e.g. display green if the value 
> is between 0 and 5, amber if 5 to 10, red if 10 or higher.  That's often 
> much more useful (except for users with colour blindness who may need 
> additional cues).
>
> However, you *can* also frig the queries in PromQL if required.  Since 
> you don't give the actual queries, I can only talk in general terms.
>
> foo < 1
> # gives you some value for foo, if it's less than 1, and no value if foo 
> >= 1.
>
> (foo < 1) * 0
> # will always gives you a value of 0 if foo < 1, or no value if foo >= 1
>
> foo < bool 1
> # will always give you a value: 0 if foo < 1, 1 if foo >= 1
>
> > For example, I might have a cluster where one of the servers can fail 
> and still display an available service (and a result of 1 for my query), 
> but having 2 failed servers would get me a result of "0" for my query.
>
> I would be inclined make a query to count "number of failed servers", and 
> set a display threshold on this.  Then the dashboard won't say "too many 
> failed servers!", it will say "2 failed servers!"
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a713ef0c-2a9e-411b-a56a-95ae65ee463bn%40googlegroups.com.

Reply via email to