Hi,

I'm having issues with Icinga2 for a while now, including with the
latest release (r2.2.3-1).

I have a HA setup with 2 nodes in the master zone. They run remote
checks on a number of other nodes using the Api. It works fine, except
I always get a handful of services on remote nodes with status
"UNKOWN". Randomly.

Running "service icinga reload multiple" on the checker nodes times (a
minute or so apart) seems to fix it and reduce the number of those
"UNKNOWN" services, but over time, new ones creep in.

Specifically, the status information for those services is (on checker
node A) "Remote Icinga instance 'XYZ' is not connected.". However, on
its buddy, checker node B, everything is fine for that host/service.

Here's an example of the status page of the same service, taken from
both checker nodes, at the same point in time:

------------------------------------------------------------------------

On checker A (host mon1-1):

Current Status:   UNKNOWN   (for 0d 10h 52m 3s)
Status Information:     Remote Icinga instance 've38-b1' is not connected.
Performance Data:       
Current Attempt:        1/3  (HARD state)
Last Check Time:        01-16-2015 08:05:42
Check Type:     ACTIVE
Check Source / Reachability:    mon1-2 / true
Check Latency / Duration:       0.000 / 0.000 seconds
Next Scheduled Active Check:    01-16-2015 09:57:48
Last State Change:      01-15-2015 23:05:42
Last Notification:      N/A (notification 0)
Is This Service Flapping?         NO   (0.00% state change)
In Scheduled Downtime?    NO
Last Update:    01-16-2015 09:57:34  ( 0d 0h 0m 11s ago)

------------------------------------------------------------------------

On checker B (host mon1-2):

Current Status:   OK   (for 0d 1h 4m 10s)
Status Information:     DISK OK - free space: /var/lib/vz 77070 MB (99%
inode=99%):
Performance Data:       /var/lib/vz=179MB;;69525;0;77250
Current Attempt:        1/3  (HARD state)
Last Check Time:        01-16-2015 09:55:48
Check Type:     ACTIVE
Check Source / Reachability:    ve38-b1 / true
Check Latency / Duration:       0.000 / 0.001 seconds
Next Scheduled Active Check:    01-16-2015 09:57:48
Last State Change:      01-16-2015 08:53:32
Last Notification:      01-16-2015 08:55:32 (notification 0)
Is This Service Flapping?         NO   (0.00% state change)
In Scheduled Downtime?    NO
Last Update:    01-16-2015 09:57:37  ( 0d 0h 0m 5s ago)

------------------------------------------------------------------------

Performance data is correctly collected, etc, but only on mon1-2 in
this case. The corresponding pnp4nagios graphs on mon1-1 show a gap
for the duration of the "UNKNOWN" status.

So am I correct in assuming that while the checking itself etc. works
just fine, there is an issue in how mon1-1 and mon1-2 share the check
results with each other?

The (debug) logs don't seem to show any anomalies on all nodes
involved (mon1-1, mon1-2 and the remote node).

Any pointers?

Thanks so much,

Florian





_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to