Re: [icinga-users] Multi-tier HA Cluster: Massive performance problem with graphite

Michael Friedrich Mon, 07 Dec 2015 00:43:53 -0800

> On 06 Dec 2015, at 18:21, Peter Eckel <li...@eckel-edv.de> wrote:
>
> Hello,
>
> I'm currently evaluating a setup for a customer who has very specific needs 
> in terms of reliability and security. Setting up a minimal test environment 
> for a pilot installation I ran into an interesting problem - maybe someone 
> has seen it before and can give me a hint.
>
> The basic setup consists of two zones, one called 'master' and one 
> 'satellite'. Both zones contain two endpoints:
>
> object Zone "global" {
>    global = true
> }
>
> object Endpoint "icinga2-master1.vm.hindenburgring.com" {
>    host = "icinga2-master1.vm.hindenburgring.com"
> }
>
> object Endpoint "icinga2-master2.vm.hindenburgring.com" {
>    host = "icinga2-master2.vm.hindenburgring.com"
> }
>
> object Endpoint "icinga2-satellite1.vm.hindenburgring.com" {
>    host = "icinga2-satellite1.vm.hindenburgring.com"
> }
>
> object Endpoint "icinga2-satellite2.vm.hindenburgring.com" {
>    host = "icinga2-satellite2.vm.hindenburgring.com"
> }
>
> object Zone "master" {
>    endpoints = [ "icinga2-master1.vm.hindenburgring.com", 
> "icinga2-master2.vm.hindenburgring.com" ]
> }
>
> object Zone "satellite" {
>    parent = "master"
>    endpoints = [ "icinga2-satellite1.vm.hindenburgring.com",  
> "icinga2-satellite2.vm.hindenburgring.com"  ]
> }
>
> In the final setup there will be seven satellite zones and the zones will 
> also get some command execution nodes, but the problem is reproducible with 
> this minimal setup.
>
> Perfdata storare for the environment is using Graphite on a separate cluster, 
> behind HAproxy and two carbon-relay instances for data replication. This 
> setup has already been tested and works reliably. Writing perfdata from all 
> Icinga 2 nodes involved also works very reliably, all the data are written 
> without any issues. So far it's perfect (and exactly what the customer 
> requires).
>
> The problem starts when I bring up the second satellite endpoint. Load on the 
> Grahite cluster increases massively, and the number of updates per second 
> jumps from about 6-7 to something in the range of 3000(!) with less than 350 
> Whisper files in Graphite. When I enable update logging on Graphite I see 
> that the vast majority of the updates come from one of the satellite 
> endpoints:
>
>> [...]
>> 06/12/2015 15:35:08 :: wrote 441 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_ok.value
>>  in 0.00660 seconds
>> 06/12/2015 15:35:08 :: wrote 442 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.min_execution_time.value
>>  in 0.00105 seconds
>> 06/12/2015 15:35:08 :: wrote 442 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_flapping.value
>>  in 0.00077 seconds
>> 06/12/2015 15:35:08 :: wrote 442 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_up.value
>>  in 0.00419 seconds
>> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_unknown.value
>>  in 0.00075 seconds
>> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.avg_latency.value
>>  in 0.00070 seconds
>> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.max_execution_time.value
>>  in 0.00063 seconds
>> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_pending.value
>>  in 0.00075 seconds
>> 06/12/2015 15:35:08 :: wrote 443 datapoints for 
>> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_warning.value
>>  in 0.00066 seconds
>> [...]
>
> The extreme load drops to normal values immediately once I shut down either 
> of the satellites - it doesn't have to be the one that's doing the extreme 
> amount of perfdata writing.


Does that pattern repeat? It seems your “icinga” check is running in a frequent 
interval producing that amount of performance data.

Please show the service object for your “icinga” check configuration.

Kind regards,
Michael

>
> The behaviour is absolutely reproducible. Since I set up the Icinga 2 
> clusters with Ansible/Vagrant it isn't a problem to delete the whole 
> environment and set it up from scratch, and I did that several times without 
> any change - as soon as the second satellite is started the metric rate 
> skyrockets, without any other issues (even the measurements look absolutely 
> OK). No errors or warnings whatsoever can be seen in the 'normal' Icinga 2 
> log at any time.
>
> Shutting down a master node does not have any effect on the phaenomenon.
>
> Since I don't have much experience with debugging Icinga 2 HA issues, I can't 
> tell much from the debug logs. If it helps, I can provide debug logs from the 
> four nodes that I sampled during a test with several startup/shutdown cycles 
> of the satellites. It can be seen from the debug log of the primary master 
> that the metric write rate increases extremely when both satellites are up 
> and connected. In the satellite's debug log there is an extreme frequency of 
> log entries of the form
>
>> [2015-12-06 17:00:44 +0000] notice/ApiListener: Relaying 
>> 'event::CheckResult' message
>> [2015-12-06 17:00:44 +0000] notice/ApiListener: Sending message to 
>> 'icinga2-satellite1.vm.hindenburgring.com'
>> [2015-12-06 17:00:44 +0000] notice/JsonRpcConnection: Received 
>> 'event::CheckResult' message from 'icinga2-satellite1.vm.hindenburgring.com'
>> [2015-12-06 17:00:44 +0000] debug/Checkable: No origin or local origin for 
>> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting 
>> icinga2-satellite2.vm.hindenburgring.com as check_source.
>> [2015-12-06 17:00:44 +0000] debug/Checkable: command_endpoint found for 
>> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting 
>> icinga2-satellite1.vm.hindenburgring.com as check_source.
>
> These messages appear in quantities of around 350 per second as long as both 
> satellites are active and stop immediately once one of them is shut down. Not 
> sure if this points to the root cause, but it's definitely connected.
>
> I would be very grateful for any idea that helps me sorting out what could be 
> wrong.
>
> Best regards,
>
>  Peter.
>
> _______________________________________________
> icinga-users mailing list
> icinga-users@lists.icinga.org
> https://lists.icinga.org/mailman/listinfo/icinga-users


-- 
Michael Friedrich, DI (FH)
Senior Developer

NETWAYS GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB18461
http://www.netways.de | michael.friedr...@netways.de

** OSDC 2016 - April - netways.de/osdc **
** OSBConf 2016 - September - osbconf.org **
_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

Re: [icinga-users] Multi-tier HA Cluster: Massive performance problem with graphite

Reply via email to