> the fail 100% of the time on that prometheus where its down Then you're lucky: in principle it's straightforward to debug. - get a shell on the affected prometheus server - use "curl" to do a manual scrape of the target which is down (using the same URL that the Targets list shows) - if it fails, then you've taken Prometheus out of the equation.
My best guesses would be (1) Network connectivity between the Prometheus server and the affected pods, or (2) service discovery is giving wrong information (i.e. you're scraping the wrong URL in the first place) In case (2), I note that you're getting the targets to scrape from pod annotations. Look carefully at the values of those annotations, and how they are mapped into scrape address/port/path for the affected pods. On Tuesday, 5 September 2023 at 11:45:04 UTC+1 Анастасия Зель wrote: > Actually its targets on different k8s nodes, but the fail 100% of the time > on that prometheus where its down. > I get list of all down pods targets and noticed that number of down pods > its the same on both prometheus nodes - 306 down pods targets. But its > different targets :D > Yes, they scrape same urls of pods. > вторник, 5 сентября 2023 г. в 10:32:15 UTC+4, Brian Candler: > >> Note that setting the scrape timeout longer than the scrape interval >> won't achieve anything. >> >> I'd suggest you investigate by looking at the history of the "up" metric: >> this will go to zero on scrape failures. Can you discern a pattern? Is it >> only on a certain type of target, or targets running on a particular k8s >> node? Is it intermittent across all targets, or some targets which fail >> 100% of the time? >> >> If you compare the Targets page on both servers, are they scraping >> exactly the same URLs? (That is, check whether service discovery is giving >> different results) >> >> On Tuesday, 5 September 2023 at 06:09:55 UTC+1 Анастасия Зель wrote: >> >>> yes, i see errors on targets page in web interface. >>> I tried to increase timeout to 5 minutes and it changes nothing. >>> Its strange because prometheus 2 always get this error on similar pods. >>> And prometheus 1 never get this errors on this pods. >>> понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: >>> >>>> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote: >>>> >>>> Hello, we use HA prometheus with two servers. >>>> >>>> You mean, two Prometheus servers with the same config, both scraping >>>> the same targets? >>>> >>>> >>>> >>>> The problem is we get different metrics in dashboards from this two >>>> servers. >>>> >>>> Small differences are to be expected. That's because the two servers >>>> won't be scraping the targets at the same points in time. If you see more >>>> significant differences, then please provide some examples. >>>> >>>> >>>> >>>> And we also scrape metrics from k8s, and some pods are not scraping >>>> because of error context deadline exceeded >>>> >>>> That basically means "scrape timed out". The scrape hadn't completed >>>> within the "scrape_timeout:" value that you've set. You'll need to look >>>> at >>>> your individual exporters and the failing scrape URLs: either the target >>>> is >>>> not reachable at all (e.g. firewalling or network configuration issue), or >>>> the target is taking too long to respond. >>>> >>>> >>>> Its differents pods on each server. In prometheus logs we dont see any >>>> of errors. >>>> >>>> Where *do* you see the "context deadline exceeded" errors then? >>>> >>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/24575a81-2302-4d4c-8b6b-e24075ddaa98n%40googlegroups.com.

