I also note that you're only sampling periodically whether something is in 
state 0, 1 or 2: you don't really know what happened in between those 
samples, so you're never going to get a truly accurate value for how long 
it was in each state. For example, it could flip from 1 to 2 and back to 1 
between scrapes.

If you want *really* accurate answers for how long your application has 
been in state 0, 1 or 2 then you need to reinstrument it with metrics which 
accumulate time in each state:

application_state_seconds_total{state="0"} xxx
application_state_seconds_total{state="1"} xxx
application_state_seconds_total{state="2"} xxx

Every time your application changes state, you add the number of seconds it 
was in that state to the total.  In case it stays in the same state for a 
long time you also do this periodically, e.g. if the application has 
remained in the same state for 1 second, then you add 1 second to the 
metric and subtract 1 second from from the time you're accumulating.

This is basically how many of the disk I/O metrics 
<https://brian-candler.medium.com/interpreting-prometheus-metrics-for-linux-disk-i-o-utilization-4db53dfedcfc>
 
work.

Then:
    increase(application_status_seconds_total{status="1"}[24h])
will give you a *very* accurate estimate of how long the application was in 
that state, even if it switches many times within the same second.

On Wednesday, 22 March 2023 at 08:06:23 UTC Brian Candler wrote:

> It's not easy to do exactly.
>
> To get a rough answer, you can do a subquery 
> <https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery>:  
> (foo == 1)[24h:1m] will resample the timeseries at 1 minute intervals, then 
> you can wrap that with count_over_time, giving:
>     count_over_time((foo == 1)[24h:1m])
>
> But if you weren't scraping at exactly 1 minute intervals, the count may 
> not be accurate. Also if there are missed samples, the value of foo at time 
> T will look back for the previous value (up to 5 minutes by default), which 
> means in that situation some samples may be double-counted (in effect, 
> assuming the metric value remained constant over that time, when you don't 
> actually know what value it had).
>
> The only way I know to get an exact answer is to send the range vector 
> query "foo[24h]" to the *instant* query endpoint 
> <https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries>, 
> then filter and count the samples client-side.  A range vector like that 
> gives the raw values with their raw timestamps as stored in the TSDB.
>
> For this use case it would be nice if Prometheus were to allow certain 
> operators to work directly on range vectors, so you could write
>     foo[24h] == 1
> But that would add quite a lot of complexity into the semantics of the 
> query language, which already has to consider argument combinations for 
> (scalar, scalar), (scalar, instant vector) and (instant vector, instant 
> vector).
>
> On Wednesday, 22 March 2023 at 05:19:01 UTC BHARATH KUMAR wrote:
>
>> Hello All,
>>
>> I have a Prometheus metric that will give output as 0 or 1 or 2. It can 
>> be anything(0 or 1 or 2). Could you tell me the number of 1's that occurred 
>> in the last 24 hours?
>>
>> I tried with count_over_time. but I am getting errors. I tried 
>> sum_over_time, but it is not working for a few test cases.
>>
>> Any lead?
>>
>> I really appreciate any help you can provide.
>>
>> Thanks & regards,
>> Bharath Kumar
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c5acfb65-8214-4b01-9ee8-ce7aa96e7c83n%40googlegroups.com.

Reply via email to