> On 06 Dec 2015, at 18:21, Peter Eckel <li...@eckel-edv.de> wrote: > > Hello, > > I'm currently evaluating a setup for a customer who has very specific needs > in terms of reliability and security. Setting up a minimal test environment > for a pilot installation I ran into an interesting problem - maybe someone > has seen it before and can give me a hint. > > The basic setup consists of two zones, one called 'master' and one > 'satellite'. Both zones contain two endpoints: > > object Zone "global" { > global = true > } > > object Endpoint "icinga2-master1.vm.hindenburgring.com" { > host = "icinga2-master1.vm.hindenburgring.com" > } > > object Endpoint "icinga2-master2.vm.hindenburgring.com" { > host = "icinga2-master2.vm.hindenburgring.com" > } > > object Endpoint "icinga2-satellite1.vm.hindenburgring.com" { > host = "icinga2-satellite1.vm.hindenburgring.com" > } > > object Endpoint "icinga2-satellite2.vm.hindenburgring.com" { > host = "icinga2-satellite2.vm.hindenburgring.com" > } > > object Zone "master" { > endpoints = [ "icinga2-master1.vm.hindenburgring.com", > "icinga2-master2.vm.hindenburgring.com" ] > } > > object Zone "satellite" { > parent = "master" > endpoints = [ "icinga2-satellite1.vm.hindenburgring.com", > "icinga2-satellite2.vm.hindenburgring.com" ] > } > > In the final setup there will be seven satellite zones and the zones will > also get some command execution nodes, but the problem is reproducible with > this minimal setup. > > Perfdata storare for the environment is using Graphite on a separate cluster, > behind HAproxy and two carbon-relay instances for data replication. This > setup has already been tested and works reliably. Writing perfdata from all > Icinga 2 nodes involved also works very reliably, all the data are written > without any issues. So far it's perfect (and exactly what the customer > requires). > > The problem starts when I bring up the second satellite endpoint. Load on the > Grahite cluster increases massively, and the number of updates per second > jumps from about 6-7 to something in the range of 3000(!) with less than 350 > Whisper files in Graphite. When I enable update logging on Graphite I see > that the vast majority of the updates come from one of the satellite > endpoints: > >> [...] >> 06/12/2015 15:35:08 :: wrote 441 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_ok.value >> in 0.00660 seconds >> 06/12/2015 15:35:08 :: wrote 442 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.min_execution_time.value >> in 0.00105 seconds >> 06/12/2015 15:35:08 :: wrote 442 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_flapping.value >> in 0.00077 seconds >> 06/12/2015 15:35:08 :: wrote 442 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_hosts_up.value >> in 0.00419 seconds >> 06/12/2015 15:35:08 :: wrote 443 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_unknown.value >> in 0.00075 seconds >> 06/12/2015 15:35:08 :: wrote 443 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.avg_latency.value >> in 0.00070 seconds >> 06/12/2015 15:35:08 :: wrote 443 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.max_execution_time.value >> in 0.00063 seconds >> 06/12/2015 15:35:08 :: wrote 443 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_pending.value >> in 0.00075 seconds >> 06/12/2015 15:35:08 :: wrote 443 datapoints for >> icinga2.icinga2-satellite1_vm_hindenburgring_com.services.icinga.icinga.perfdata.num_services_warning.value >> in 0.00066 seconds >> [...] > > The extreme load drops to normal values immediately once I shut down either > of the satellites - it doesn't have to be the one that's doing the extreme > amount of perfdata writing.
Does that pattern repeat? It seems your “icinga” check is running in a frequent interval producing that amount of performance data. Please show the service object for your “icinga” check configuration. Kind regards, Michael > > The behaviour is absolutely reproducible. Since I set up the Icinga 2 > clusters with Ansible/Vagrant it isn't a problem to delete the whole > environment and set it up from scratch, and I did that several times without > any change - as soon as the second satellite is started the metric rate > skyrockets, without any other issues (even the measurements look absolutely > OK). No errors or warnings whatsoever can be seen in the 'normal' Icinga 2 > log at any time. > > Shutting down a master node does not have any effect on the phaenomenon. > > Since I don't have much experience with debugging Icinga 2 HA issues, I can't > tell much from the debug logs. If it helps, I can provide debug logs from the > four nodes that I sampled during a test with several startup/shutdown cycles > of the satellites. It can be seen from the debug log of the primary master > that the metric write rate increases extremely when both satellites are up > and connected. In the satellite's debug log there is an extreme frequency of > log entries of the form > >> [2015-12-06 17:00:44 +0000] notice/ApiListener: Relaying >> 'event::CheckResult' message >> [2015-12-06 17:00:44 +0000] notice/ApiListener: Sending message to >> 'icinga2-satellite1.vm.hindenburgring.com' >> [2015-12-06 17:00:44 +0000] notice/JsonRpcConnection: Received >> 'event::CheckResult' message from 'icinga2-satellite1.vm.hindenburgring.com' >> [2015-12-06 17:00:44 +0000] debug/Checkable: No origin or local origin for >> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting >> icinga2-satellite2.vm.hindenburgring.com as check_source. >> [2015-12-06 17:00:44 +0000] debug/Checkable: command_endpoint found for >> object 'icinga2-satellite1.vm.hindenburgring.com!swap', setting >> icinga2-satellite1.vm.hindenburgring.com as check_source. > > These messages appear in quantities of around 350 per second as long as both > satellites are active and stop immediately once one of them is shut down. Not > sure if this points to the root cause, but it's definitely connected. > > I would be very grateful for any idea that helps me sorting out what could be > wrong. > > Best regards, > > Peter. > > _______________________________________________ > icinga-users mailing list > icinga-users@lists.icinga.org > https://lists.icinga.org/mailman/listinfo/icinga-users -- Michael Friedrich, DI (FH) Senior Developer NETWAYS GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB18461 http://www.netways.de | michael.friedr...@netways.de ** OSDC 2016 - April - netways.de/osdc ** ** OSBConf 2016 - September - osbconf.org ** _______________________________________________ icinga-users mailing list icinga-users@lists.icinga.org https://lists.icinga.org/mailman/listinfo/icinga-users