Hello, I have just submitted a proposal for a design (built on Alin's excellent work) that addresses the concerns discussed here. We have been running a fork that implements that design for about 6 months now with excellent results.
Issue: https://github.com/prometheus/prometheus/issues/12967 Design Proposal: https://docs.google.com/document/d/1CF5jhyxSD437c2aU2wHcvg88i8CjSPO3kMHsEaDRe2w/edit#heading=h.hzsa87ps5uhr -Colin On Monday, April 2, 2018 at 1:51:54 PM UTC-7 Alin Sînpălean wrote: > [I'll give this a try, even though it is likely going to be marked as spam > and left as such.] > > > On Saturday, March 31, 2018 at 8:27:00 AM UTC+2, Brian Brazil wrote: >> >> On 26 March 2018 at 14:38, Alin Sînpălean <[email protected]> wrote: >> >>> You can do one of two things. I am also of the opinion that >>> rate()/increase() should not extrapolate, but it doesn't look like that >>> will change anytime soon, so both of these are workarounds to current >>> Prometheus limitations. >>> >>> >>> 1. Use foo - foo offset 1m instead of increase(foo[1m]). It will not >>> take into account counter resets (you could do that by doing evaluation >>> every collection interval and accounting for it, if you actually care >>> about >>> that) and will take twice as much CPU (2 series lookups instead of one), >>> but it will give you an accurate increase, no extrapolation. >>> >>> This is incorrect, it'll not be accurate as metrics can't be accurate. >> It's just a different, not 100% accurate, approximation. >> > > Yes, it is not going to be perfectly accurate but, as the OP states -- > "able to prove this by [preventing] the extrapolation code from running" -- > it will do the job for them. "Accurate" was the term used by the OP, BTW, > to describe the results they got without increase() extrapolation. > >> >>> 1. If you want to take advantage of Prometheus' counter reset >>> handling, use increase(foo[70s]) * 60 / 70 wherever you would >>> normally use increase(foo[60s]) (assuming a collection interval of >>> 10s). It basically computes the increase over 6 successive collections >>> (7 >>> successive points), then undoes the extrapolation. Ugly and requires you >>> to >>> take into account both collection and evaluation intervals (and hope >>> they >>> never change), but it works. >>> >>> This is not resilient to jitter, and is not a good approach. Generally >> this will overestimate by 16% as you're multiplying by 1.16. >> > > No, Prometheus in general is not resilient to jitter. Outside of > /range_query, which actually is, under the right conditions, i.e. no > rate()/increase() extrapolation. > > Prometheus could be resilient to (eval) jitter if it wanted to, e.g. by > delaying evaluation until all scrapes in progress were complete and then > running the evaluation similar to the way /range_query does it, at > exactly spaced intervals. But no one is asking for that here, AFAICT. > >> >> As I said, if the OP wants an accurate result they need logs. >> > > Umm, no. As the OP said, they want to prevent extrapolation to get > "accurate enough" results for their needs. They never said they need > perfect results. > > The only material difference between logs and metrics is that logs have > (in theory) infinite resolution, whereas metrics (in the Prometheus world) > have some fixed time resolution, decided ahead of time, plus scrape jitter. > But as long as you don't fail a large number of successive scrapes (which > is in many respects similar to a logs collector losing lots of log records > on the way) you are still able to compute an increase over some interval. > It may not be the exact interval you want (either because of scrape > resolution or because of missed scrapes) but an exact increase over some > interval it is. (In the logs case, if some log records go missing you can't > even get that.) > > In particular, if you do foo - foo offset 5m exactly every 5 minutes (the > way /range_eval does) and you have at least one successful scrape every 5 > minutes, you will get a perfectly accurate increase, which you can then > aggregate over time and get an accurate increase over e.g. 24 hours. It > won't handle task restarts perfectly, but neither will logs. > > If someone thinks that extrapolation is a problem then metrics cannot meet >> their use case, as scrapes won't be perfectly aligned with the data window >> of interest. >> > > I am someone who thinks extrapolation is a problem while being sure > metrics can meet my use case, because it has nothing to do with perfectly > aligned scrapes. I wouldn't mind if (because of scrape interval jitter) I > ended up with a timeseries [0, 1, 2, 2, 4, 5] (instead of an ideal [0, 1, > 2, 3, 4, 5]) and a total increase of 5. I do mind that from this imperfect > timeseries Prometheus guesstimates an increase of 6, though. Or, to be more > precise, some random fraction between 5.0 and 6.0 (extrapolation to the > right but not to the left, due to the 0), depending solely on when the > kernel scheduler decides to schedule the evaluation. > > Cheers, > Alin. > >> >> Brian >> >> >>> Cheers, >>> Alin. >>> >>> On Monday, March 5, 2018 at 6:14:37 PM UTC+1, [email protected] wrote: >>>> >>>> We have a requirement to calculate accurate availability figures for >>>> our applications. We have found that the metrics we need to make the >>>> calculations are already contained in the Prometheus databases that our >>>> components use. However, we are only able to get the results we need if >>>> we >>>> use the 'increase' function without the extrapolation. We were able to >>>> prove this by manipulating the data to make sure the time range boundary >>>> was far enough away from the first and last sample to prevent the >>>> extrapolation code from running. >>>> >>>> So we are considering options to export the data from Prometheus and >>>> replicate the increase function but without the extrapolation. >>>> >>>> This begs the question, would you accept a PR to add a new increase >>>> function that does 'rate' instead of 'extrapolatedRate'? The user would be >>>> able to decide which one to use for their needs. >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> >> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected]. >>> To post to this group, send email to [email protected]. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/de54fe92-dde0-4253-ae86-92d0cfdcb6e3%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/prometheus-users/de54fe92-dde0-4253-ae86-92d0cfdcb6e3%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Brian Brazil >> www.robustperception.io >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/62842035-eee7-4eec-8082-f58dd0110df2n%40googlegroups.com.

