[icinga-users] Cluster reconnect problems when reloading configuration

Jan Dittberner Wed, 19 Jul 2017 04:46:51 -0700

Hello,

I setup an distributed Icinga2 system with a master and multiple clients
(actually multiple Debian kvm VMs on the same host). The master has
endpoints with addresses of all the clients and the clients have endpoints
with only the name of the master setup like this:


--- zones.conf on master ---------------------------------------------------
object Endpoint "monitor.gnuviech.internal" {
  host = "10.0.0.25"
}

object Endpoint "ldap.gnuviech.internal" {
  host = "10.0.0.11"
}

object Endpoint "mq.gnuviech.internal" {
  host = "10.0.0.17"
}

object Zone "master" {
  endpoints = [ "monitor.gnuviech.internal" ]
}

object Zone "ldap.gnuviech.internal" {
  endpoints = [ "ldap.gnuviech.internal" ]
  parent = "master"
}

object Zone "mq.gnuviech.internal" {
  endpoints = [ "mq.gnuviech.internal" ]
  parent = "master"
}

object Zone "global-templates" {
  global = true
}
----------------------------------------------------------------------------

--- zones.conf on client ldap ----------------------------------------------
object Endpoint "monitor.gnuviech.internal" {
}

object Endpoint "ldap.gnuviech.internal" {
  host = "10.0.0.11"
}

object Zone "master" {
  endpoints = [ "monitor.gnuviech.internal" ]
}

object Zone "ldap.gnuviech.internal" {
  endpoints = [ "ldap.gnuviech.internal" ]
  parent = "master"
}

object Zone "global-templates" {
  global = true
}
----------------------------------------------------------------------------

This setup works fine initialy. The master connects to all the clients as
expected and service checks are executed successfully.

I maintain the /etc/icinga2/zones.d directory in a git repository and after
fetching new configuration I reload the Icinga2 master. Unfortunatelly this
seems to break the cluster communication. I have a service check

--- cluster service check --------------------------------------------------
object Service "cluster" {
  check_command = "cluster"
  check_interval = 5s
  retry_interval = 1s

  host_name = "monitor.gnuviech.internal"
}
----------------------------------------------------------------------------

that becomes critical after the reload. The count of disconnected clients is
not always the same. The only way to get this sorted out is to stop the
Icinga master, wait for some seconds and start the master again. systemctl
restart icinga2 is not sufficient.

The master log has entries like:

[2017-07-13 12:13:17 +0200] information/JsonRpcConnection: Reconnecting to API 
endpoint 'mq.gnuviech.internal' via host '10.0.0.17' and port '5665'
[2017-07-13 12:13:17 +0200] information/JsonRpcConnection: Reconnecting to API 
endpoint 'ldap.gnuviech.internal' via host '10.0.0.11' and port '5665'
[2017-07-13 12:13:17 +0200] critical/TcpSocket: Invalid socket: Connection 
refused
[2017-07-13 12:13:17 +0200] critical/TcpSocket: Invalid socket: Connection 
refused

which seems strange to me because the icinga2 processes on these endpoints
are not changed during the reload. I would assume that the master should
just reconnect to them and continue with config synchronisation (for
global-templates) and start sending check execution commands afterwards.

Do you have any idea what might be wrong with my setup? Did I encounter a
bug or is this some common misconfiguration? Why could the master get a
Connection refused response?


Best regards
Jan Dittberner

-- 
Jan Dittberner - Debian Developer
GPG-key: 4096R/0xA73E0055558FB8DD 2009-05-10
         B2FF 1D95 CE8F 7A22 DF4C  F09B A73E 0055 558F B8DD
https://jan.dittberner.info/

signature.asc
Description: PGP signature

_______________________________________________
icinga-users mailing list
icinga-users@lists.icinga.org
https://lists.icinga.org/mailman/listinfo/icinga-users

[icinga-users] Cluster reconnect problems when reloading configuration

Reply via email to