Re: [prometheus-users] Extracting long queries from multiple histograms

Julius Volz Fri, 22 Apr 2022 02:23:27 -0700

On Fri, Apr 22, 2022 at 12:50 AM Victor Sudakov <[email protected]> wrote:


> Julius Volz wrote:
>
> [dd]
> > >
> > > The query `app1_response_duration_bucket{{le="0.75"}` will return a
> > > list of endpoints which have responded faster than 0.75s.
> > >
> >
> > This is not quite correct - this query gives you the le="0.75" bucket
> > counter for *all* endpoints,
>
> OK, I stand corrected.
>
> > and the value of each bucket counter tells you
> > how many requests that endpoint has handled that completed within 0.75s
> > since the exposing process started tracking things.
>
> What if I want to see how many requests each endpoint has handled that
> DID NOT complete within 0.75s since the exposing process started
> tracking things?
>

Then you could subtract the le="0.75" bucket from the total count (which is
available both in the _bucket{le="+Inf"} as well as the _count series of
the histogram):

----------
  app1_response_duration_bucket{le="+Inf"}
- ignoring(le)
  app1_response_duration_bucket{le="0.75"}
----------

The "ignoring(le)" tells the binary operator to ignore the "le" label for
vector matching, since it will always be different on both sides.

And then you could also add a filter to only show outputs with >0 requests:

----------
  app1_response_duration_bucket{le="+Inf"}
- ignoring(le)
  app1_response_duration_bucket{le="0.75"}
 > 0
----------

BUT: It's important to note that operating on a raw histogram counter like
this is not recommended, as the counts will totally depend on when the
process started handling & tracking requests (e.g. 5 minutes ago vs. 2
months ago). You most likely will want to at least wrap rate() or
increase() around the histogram counters to only consider the behavior of
the histogram counters within a defined time range like the last 5 minutes,
last 1h, etc.:

----------
  rate(app1_response_duration_bucket{le="+Inf"}[5m])
- ignoring(le)
  rate(app1_response_duration_bucket{le="0.75"}[5m])
 > 0
----------

The above would give you the per-second rate of slow requests for any
endpoints that received any slow requests within the last 5m. Use
increase() instead of rate() if you want absolute vs. per-second numbers.

>
> >
> > > How do I invert the "le" and find the endpoints slower than "le"?
> > >
> >
> > Hmm, histograms are usually used to tell you about the *distribution* of
> > request latencies to a given endpoint (or other label combination). So
> it's
> > unclear what you mean with an endpoint being slower than some "le" value.
>
> Please see above.
>
> > Do you want to find out whether some endpoint has handled any requests
> *at
> > all* that took longer than some duration? Or only if that happened in the
> > last X amount of time?
>
> Yes, I think I can put it like this. I would like to be informed if any
> endpoint has become "slow" and the details may vary.
>
>
> > Or only if a certain percentage of requests were too
> > slow?
> >
> > One thing people frequently do is to calculate percentiles / quantiles
> from
> > a histogram, for example:
> >
> >     histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))
> >
> > ...would tell you the approximated 90th percentile latency in seconds as
> > averaged over a moving 5-minute window for a given label combination,
> which
> > you can then combine with a filter operator to find slow endpoints (e.g.
> > "... > 10" would give you those endpoints that have a 90th percentile
> > latency above 10s).
>
> I've tried to graph "histogram_quantile(0.9,
> rate(app1_response_duration_bucket[5m])) > 3"
> but the result is very hard to interpret visually. It almost makes no
> sense.
>
> It's slightly more understandable as a table/list.
>

Yes, queries that include filtering constructs usually look weird in graphs
because the filter criterium might be true at some time steps in the graph,
but not in others, so you can get graphs with many short intermittent
series. Filters are more commonly used for alerting / table queries.


> --
> Victor Sudakov VAS4-RIPE
> http://vas.tomsk.ru/
> 2:5005/49@fidonet
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/YmHfziweOcQGpIjh%40admin.sibptus.ru
> .
>


-- 
Julius Volz
PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAObpH5ynO22zX5wCnCE1AYXC_sS7UXX1PkzE2Bu6%3DqvXbiR1RQ%40mail.gmail.com.

Re: [prometheus-users] Extracting long queries from multiple histograms

Reply via email to