Hello, I'm currently evaluating a setup for a customer who has very specific needs in terms of reliability and security. Setting up a minimal test environment for a pilot installation I ran into an interesting problem - maybe someone has seen it before and can give me a hint.
The basic setup consists of two zones, one called 'master' and one 'satellite'. Both zones contain two endpoints: object Zone "global" { global = true } object Endpoint "icinga2-master1.vm.hindenburgring.com" { host = "icinga2-master1.vm.hindenburgring.com" } object Endpoint "icinga2-master2.vm.hindenburgring.com" { host = "icinga2-master2.vm.hindenburgring.com" } object Endpoint "icinga2-satellite1.vm.hindenburgring.com" { host = "icinga2-satellite1.vm.hindenburgring.com" } object Endpoint "icinga2-satellite2.vm.hindenburgring.com" { host = "icinga2-satellite2.vm.hindenburgring.com" } object Zone "master" { endpoints = [ "icinga2-master1.vm.hindenburgring.com", "icinga2-master2.vm.hindenburgring.com" ] } object Zone "satellite" { parent = "master" endpoints = [ "icinga2-satellite1.vm.hindenburgring.com", "icinga2-satellite2.vm.hindenburgring.com" ] } In the final setup there will be seven satellite zones and the zones will also get some command execution nodes, but the problem is reproducible with this minimal setup. Perfdata storare for the environment is using Graphite on a separate cluster, behind HAproxy and two carbon-relay instances for data replication. This setup has already been tested and works reliably. Writing perfdata from all Icinga 2 nodes involved also works very reliably, all the data are written without any issues. So far it's perfect (and exactly what the customer requires). The problem starts when I bring up the second satellite endpoint. Load on the Grahite cluster increases massively, and the number of updates per second jumps from about 6-7 to something in the range of 3000(!) with less than 350 Whisper files in Graphite. When I enable update logging on Graphite I see that the vast majority of the updates come from one of the satellite endpoints: > [...] > 06/12/2015 15:35:08 :: wrote 441 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_ok.value > in 0.00660 seconds > 06/12/2015 15:35:08 :: wrote 442 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.min_execution_time.value > in 0.00105 seconds > 06/12/2015 15:35:08 :: wrote 442 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_flapping.value > in 0.00077 seconds > 06/12/2015 15:35:08 :: wrote 442 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_up.value > in 0.00419 seconds > 06/12/2015 15:35:08 :: wrote 443 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_unknown.value > in 0.00075 seconds > 06/12/2015 15:35:08 :: wrote 443 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.avg_latency.value > in 0.00070 seconds > 06/12/2015 15:35:08 :: wrote 443 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.max_execution_time.value > in 0.00063 seconds > 06/12/2015 15:35:08 :: wrote 443 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_pending.value > in 0.00075 seconds > 06/12/2015 15:35:08 :: wrote 443 datapoints for > icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_warning.value > in 0.00066 seconds > [...] The extreme load drops to normal values immediately once I shut down either of the satellites - it doesn't have to be the one that's doing the extreme amount of perfdata writing. The behaviour is absolutely reproducible. Since I set up the Icinga 2 clusters with Ansible/Vagrant it isn't a problem to delete the whole environment and set it up from scratch, and I did that several times without any change - as soon as the second satellite is started the metric rate skyrockets, without any other issues (even the measurements look absolutely OK). No errors or warnings whatsoever can be seen in the 'normal' Icinga 2 log at any time. Shutting down a master node does not have any effect on the phaenomenon. Since I don't have much experience with debugging Icinga 2 HA issues, I can't tell much from the debug logs. If it helps, I can provide debug logs from the four nodes that I sampled during a test with several startup/shutdown cycles of the satellites. It can be seen from the debug log of the primary master that the metric write rate increases extremely when both satellites are up and connected. In the satellite's debug log there is an extreme frequency of log entries of the form > [2015-12-06 17:00:44 +0000] notice/ApiListener: Relaying 'event::CheckResult' > message > [2015-12-06 17:00:44 +0000] notice/ApiListener: Sending message to > 'icinga2-satellite1.vm.hindenburgring.com' > [2015-12-06 17:00:44 +0000] notice/JsonRpcConnection: Received > 'event::CheckResult' message from 'icinga2-satellite1.vm.hindenburgring.com' > [2015-12-06 17:00:44 +0000] debug/Checkable: No origin or local origin for > object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting > icinga2-satellite2.vm.hindenburgring.com as check_source. > [2015-12-06 17:00:44 +0000] debug/Checkable: command_endpoint found for > object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting > icinga2-satellite1.vm.hindenburgring.com as check_source. These messages appear in quantities of around 350 per second as long as both satellites are active and stop immediately once one of them is shut down. Not sure if this points to the root cause, but it's definitely connected. I would be very grateful for any idea that helps me sorting out what could be wrong. Best regards, Peter. _______________________________________________ icinga-users mailing list icinga-users@lists.icinga.org https://lists.icinga.org/mailman/listinfo/icinga-users