[prometheus-users] Re: Promteheus HA different metrics

Анастасия Зель Tue, 05 Sep 2023 06:26:14 -0700

yeah, i think scrape manually it will be useful but remember that its k8s 
pods :)
i only have pod ip and i cant get it from prometheus node because they are 
in different subnets. Pods subnet don't have access to outside network. 
so i dont know how i can scrape manually particular pod target from 
prometheus server.


but thank you for yours guesses, i will check it out
вторник, 5 сентября 2023 г. в 15:06:30 UTC+4, Brian Candler: 

> > the fail 100% of the time on that prometheus where its down
>
> Then you're lucky: in principle it's straightforward to debug.
> - get a shell on the affected prometheus server
> - use "curl" to do a manual scrape of the target which is down (using the 
> same URL that the Targets list shows)
> - if it fails, then you've taken Prometheus out of the equation.
>
> My best guesses would be (1) Network connectivity between the Prometheus 
> server and the affected pods, or (2) service discovery is giving wrong 
> information (i.e. you're scraping the wrong URL in the first place)
>
> In case (2), I note that you're getting the targets to scrape from pod 
> annotations. Look carefully at the values of those annotations, and how 
> they are mapped into scrape address/port/path for the affected pods.
>
> On Tuesday, 5 September 2023 at 11:45:04 UTC+1 Анастасия Зель wrote:
>
>> Actually its targets on different k8s nodes, but the fail 100% of the 
>> time on that prometheus where its down. 
>> I get list of all down pods targets and noticed that number of down pods 
>> its the same on both prometheus nodes - 306 down pods targets. But its 
>> different targets :D
>> Yes, they scrape same urls of pods.
>> вторник, 5 сентября 2023 г. в 10:32:15 UTC+4, Brian Candler: 
>>
>>> Note that setting the scrape timeout longer than the scrape interval 
>>> won't achieve anything.
>>>
>>> I'd suggest you investigate by looking at the history of the "up" 
>>> metric: this will go to zero on scrape failures.  Can you discern a 
>>> pattern?  Is it only on a certain type of target, or targets running on a 
>>> particular k8s node?  Is it intermittent across all targets, or some 
>>> targets which fail 100% of the time?
>>>
>>> If you compare the Targets page on both servers, are they scraping 
>>> exactly the same URLs?  (That is, check whether service discovery is giving 
>>> different results)
>>>
>>> On Tuesday, 5 September 2023 at 06:09:55 UTC+1 Анастасия Зель wrote:
>>>
>>>> yes, i see errors on targets page in web interface.
>>>> I tried to increase timeout to 5 minutes and it changes nothing. 
>>>> Its strange because prometheus 2 always get this error on similar pods. 
>>>> And prometheus 1 never get this errors on this pods. 
>>>> понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: 
>>>>
>>>>> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:
>>>>>
>>>>> Hello, we use HA prometheus with two servers.
>>>>>
>>>>> You mean, two Prometheus servers with the same config, both scraping 
>>>>> the same targets?
>>>>>
>>>>>  
>>>>>
>>>>> The problem is we get different metrics in dashboards from this two 
>>>>> servers.
>>>>>
>>>>> Small differences are to be expected.  That's because the two servers 
>>>>> won't be scraping the targets at the same points in time.  If you see 
>>>>> more 
>>>>> significant differences, then please provide some examples.
>>>>>
>>>>>  
>>>>>
>>>>> And we also scrape metrics from k8s, and some pods are not scraping 
>>>>> because of error context deadline exceeded
>>>>>
>>>>> That basically means "scrape timed out".  The scrape hadn't completed 
>>>>> within the "scrape_timeout:" value that you've set.  You'll need to look 
>>>>> at 
>>>>> your individual exporters and the failing scrape URLs: either the target 
>>>>> is 
>>>>> not reachable at all (e.g. firewalling or network configuration issue), 
>>>>> or 
>>>>> the target is taking too long to respond.
>>>>>  
>>>>>
>>>>> Its differents pods on each server. In prometheus logs we dont see any 
>>>>> of errors.
>>>>>
>>>>> Where *do* you see the "context deadline exceeded" errors then?
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/494ada91-c4b8-4ea5-bdbc-4db440c4a40en%40googlegroups.com.

[prometheus-users] Re: Promteheus HA different metrics

Reply via email to