Hi, I'm having issues with Icinga2 for a while now, including with the latest release (r2.2.3-1).
I have a HA setup with 2 nodes in the master zone. They run remote checks on a number of other nodes using the Api. It works fine, except I always get a handful of services on remote nodes with status "UNKOWN". Randomly. Running "service icinga reload multiple" on the checker nodes times (a minute or so apart) seems to fix it and reduce the number of those "UNKNOWN" services, but over time, new ones creep in. Specifically, the status information for those services is (on checker node A) "Remote Icinga instance 'XYZ' is not connected.". However, on its buddy, checker node B, everything is fine for that host/service. Here's an example of the status page of the same service, taken from both checker nodes, at the same point in time: ------------------------------------------------------------------------ On checker A (host mon1-1): Current Status: UNKNOWN (for 0d 10h 52m 3s) Status Information: Remote Icinga instance 've38-b1' is not connected. Performance Data: Current Attempt: 1/3 (HARD state) Last Check Time: 01-16-2015 08:05:42 Check Type: ACTIVE Check Source / Reachability: mon1-2 / true Check Latency / Duration: 0.000 / 0.000 seconds Next Scheduled Active Check: 01-16-2015 09:57:48 Last State Change: 01-15-2015 23:05:42 Last Notification: N/A (notification 0) Is This Service Flapping? NO (0.00% state change) In Scheduled Downtime? NO Last Update: 01-16-2015 09:57:34 ( 0d 0h 0m 11s ago) ------------------------------------------------------------------------ On checker B (host mon1-2): Current Status: OK (for 0d 1h 4m 10s) Status Information: DISK OK - free space: /var/lib/vz 77070 MB (99% inode=99%): Performance Data: /var/lib/vz=179MB;;69525;0;77250 Current Attempt: 1/3 (HARD state) Last Check Time: 01-16-2015 09:55:48 Check Type: ACTIVE Check Source / Reachability: ve38-b1 / true Check Latency / Duration: 0.000 / 0.001 seconds Next Scheduled Active Check: 01-16-2015 09:57:48 Last State Change: 01-16-2015 08:53:32 Last Notification: 01-16-2015 08:55:32 (notification 0) Is This Service Flapping? NO (0.00% state change) In Scheduled Downtime? NO Last Update: 01-16-2015 09:57:37 ( 0d 0h 0m 5s ago) ------------------------------------------------------------------------ Performance data is correctly collected, etc, but only on mon1-2 in this case. The corresponding pnp4nagios graphs on mon1-1 show a gap for the duration of the "UNKNOWN" status. So am I correct in assuming that while the checking itself etc. works just fine, there is an issue in how mon1-1 and mon1-2 share the check results with each other? The (debug) logs don't seem to show any anomalies on all nodes involved (mon1-1, mon1-2 and the remote node). Any pointers? Thanks so much, Florian _______________________________________________ icinga-users mailing list icinga-users@lists.icinga.org https://lists.icinga.org/mailman/listinfo/icinga-users