Hello,

I have just submitted a proposal for a design (built on Alin's excellent 
work) that addresses the concerns discussed here. We have been running a 
fork that implements that design for about 6 months now with excellent 
results.

Issue: https://github.com/prometheus/prometheus/issues/12967
Design 
Proposal: 
https://docs.google.com/document/d/1CF5jhyxSD437c2aU2wHcvg88i8CjSPO3kMHsEaDRe2w/edit#heading=h.hzsa87ps5uhr

-Colin

On Monday, April 2, 2018 at 1:51:54 PM UTC-7 Alin Sînpălean wrote:

> [I'll give this a try, even though it is likely going to be marked as spam 
> and left as such.]
>
>
> On Saturday, March 31, 2018 at 8:27:00 AM UTC+2, Brian Brazil wrote:
>>
>> On 26 March 2018 at 14:38, Alin Sînpălean <[email protected]> wrote:
>>
>>> You can do one of two things. I am also of the opinion that 
>>> rate()/increase() should not extrapolate, but it doesn't look like that 
>>> will change anytime soon, so both of these are workarounds to current 
>>> Prometheus limitations.
>>>
>>>
>>>    1. Use foo - foo offset 1m instead of increase(foo[1m]). It will not 
>>>    take into account counter resets (you could do that by doing evaluation 
>>>    every collection interval and accounting for it, if you actually care 
>>> about 
>>>    that) and will take twice as much CPU (2 series lookups instead of one), 
>>>    but it will give you an accurate increase, no extrapolation.
>>>
>>> This is incorrect, it'll not be accurate as metrics can't be accurate. 
>> It's just a different, not 100% accurate, approximation.
>>
>
> Yes, it is not going to be perfectly accurate but, as the OP states -- 
> "able to prove this by [preventing] the extrapolation code from running" -- 
> it will do the job for them. "Accurate" was the term used by the OP, BTW, 
> to describe the results they got without increase() extrapolation.
>
>>
>>>    1. If you want to take advantage of Prometheus' counter reset 
>>>    handling, use increase(foo[70s]) * 60 / 70 wherever you would 
>>>    normally use increase(foo[60s]) (assuming a collection interval of 
>>>    10s). It basically computes the increase over 6 successive collections 
>>> (7 
>>>    successive points), then undoes the extrapolation. Ugly and requires you 
>>> to 
>>>    take into account both collection and evaluation intervals (and hope 
>>> they 
>>>    never change), but it works.
>>>
>>> This is not resilient to jitter, and is not a good approach. Generally 
>> this will overestimate by 16% as you're multiplying by 1.16.
>>
>
> No, Prometheus in general is not resilient to jitter. Outside of 
> /range_query, which actually is, under the right conditions, i.e. no 
> rate()/increase() extrapolation.
>
> Prometheus could be resilient to (eval) jitter if it wanted to, e.g. by 
> delaying evaluation until all scrapes in progress were complete and then 
> running the evaluation similar to the way /range_query does it, at 
> exactly spaced intervals. But no one is asking for that here, AFAICT.
>
>>
>> As I said, if the OP wants an accurate result they need logs.
>>
>
> Umm, no. As the OP said, they want to prevent extrapolation to get 
> "accurate enough" results for their needs. They never said they need 
> perfect results.
>
> The only material difference between logs and metrics is that logs have 
> (in theory) infinite resolution, whereas metrics (in the Prometheus world) 
> have some fixed time resolution, decided ahead of time, plus scrape jitter. 
> But as long as you don't fail a large number of successive scrapes (which 
> is in many respects similar to a logs collector losing lots of log records 
> on the way) you are still able to compute an increase over some interval. 
> It may not be the exact interval you want (either because of scrape 
> resolution or because of missed scrapes) but an exact increase over some 
> interval it is. (In the logs case, if some log records go missing you can't 
> even get that.)
>
> In particular, if you do foo - foo offset 5m exactly every 5 minutes (the 
> way /range_eval does) and you have at least one successful scrape every 5 
> minutes, you will get a perfectly accurate increase, which you can then 
> aggregate over time and get an accurate increase over e.g. 24 hours. It 
> won't handle task restarts perfectly, but neither will logs.
>
> If someone thinks that extrapolation is a problem then metrics cannot meet 
>> their use case, as scrapes won't be perfectly aligned with the data window 
>> of interest.
>>
>
> I am someone who thinks extrapolation is a problem while being sure 
> metrics can meet my use case, because it has nothing to do with perfectly 
> aligned scrapes. I wouldn't mind if (because of scrape interval jitter) I 
> ended up with a timeseries [0, 1, 2, 2, 4, 5] (instead of an ideal [0, 1, 
> 2, 3, 4, 5]) and a total increase of 5. I do mind that from this imperfect 
> timeseries Prometheus guesstimates an increase of 6, though. Or, to be more 
> precise, some random fraction between 5.0 and 6.0 (extrapolation to the 
> right but not to the left, due to the 0), depending solely on when the 
> kernel scheduler decides to schedule the evaluation.
>
> Cheers,
> Alin.
>
>>
>> Brian
>>  
>>
>>> Cheers,
>>> Alin.
>>>
>>> On Monday, March 5, 2018 at 6:14:37 PM UTC+1, [email protected] wrote:
>>>>
>>>> We have a requirement to calculate accurate availability figures for 
>>>> our applications.  We have found that the metrics we need to make the 
>>>> calculations are already contained in the Prometheus databases that our 
>>>> components use.  However, we are only able to get the results we need if 
>>>> we 
>>>> use the 'increase' function without the extrapolation.  We were able to 
>>>> prove this by manipulating the data to make sure the time range boundary 
>>>> was far enough away from the first and last sample to prevent the 
>>>> extrapolation code from running.
>>>>
>>>> So we are considering options to export the data from Prometheus and 
>>>> replicate the increase function but without the extrapolation.
>>>>
>>>> This begs the question, would you accept a PR to add a new increase 
>>>> function that does 'rate' instead of 'extrapolatedRate'? The user would be 
>>>> able to decide which one to use for their needs.
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group.
>>>
>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/de54fe92-dde0-4253-ae86-92d0cfdcb6e3%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-users/de54fe92-dde0-4253-ae86-92d0cfdcb6e3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> Brian Brazil
>> www.robustperception.io
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/62842035-eee7-4eec-8082-f58dd0110df2n%40googlegroups.com.

Reply via email to