[Ubuntu-ha] [Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived, heartbeat, corosync, pacemaker (interface aliases are restarted)

Rafael David Tinoco Tue, 19 Nov 2019 08:52:10 -0800

(k)rafaeldtinoco@kcluster01:~$ dpkg -l | grep "ii  systemd " 
ii  systemd    243-3ubuntu1    amd64    system and service manager


k)rafaeldtinoco@kcluster01:~$ for name in kcluster01 kcluster02
kcluster03; do ssh $name "dpkg -l | grep systemd "; done | grep "ii
systemd "

ii  systemd    243-3ubuntu1    amd64    system and service manager
ii  systemd    243-3ubuntu1    amd64    system and service manager
ii  systemd    243-3ubuntu1    amd64    system and service manager
----

(k)rafaeldtinoco@kcluster01:~$ for name in kcluster01 kcluster02 kcluster03; do 
ssh $name "cat /etc/systemd/network/10-netplan-eth3.network"; done 
[Match]
Name=eth3

[Network]
LinkLocalAddressing=ipv6
Address=10.0.3.2/24
KeepConfiguration=static
[Match]
Name=eth3

[Network]
LinkLocalAddressing=ipv6
Address=10.0.3.3/24
KeepConfiguration=static
[Match]
Name=eth3

[Network]
LinkLocalAddressing=ipv6
Address=10.0.3.4/24
KeepConfiguration=static

----

(k)rafaeldtinoco@kcluster01:~$ crm status
Stack: corosync
Current DC: kcluster01 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Tue Nov 19 16:38:15 2019
Last change: Mon Nov 18 12:41:14 2019 by root via crm_resource on kcluster01

3 nodes configured
5 resources configured

Online: [ kcluster01 kcluster02 kcluster03 ]

Full list of resources:

 fence_kcluster01       (stonith:fence_virsh):  Started kcluster02
 fence_kcluster02       (stonith:fence_virsh):  Started kcluster01
 fence_kcluster03       (stonith:fence_virsh):  Started kcluster01
 Resource Group: webserver_virtual_ip
     webserver  (systemd:lighttpd):     Started kcluster01
     virtual_ip (ocf::heartbeat:IPaddr2):       Started kcluster01

----

(k)rafaeldtinoco@kcluster01:~$ for name in kcluster01 kcluster02
kcluster03; do ssh $name "hostname ; ip addr show eth3"; done

kcluster01
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP 
group default qlen 1000
    link/ether 52:54:00:11:a0:03 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.2/24 brd 10.0.3.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet 10.0.3.1/24 brd 10.0.3.255 scope global secondary eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe11:a003/64 scope link 
       valid_lft forever preferred_lft forever
kcluster02
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP 
group default qlen 1000
    link/ether 52:54:00:1d:1a:cc brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.3/24 brd 10.0.3.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe1d:1acc/64 scope link 
       valid_lft forever preferred_lft forever
kcluster03
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP 
group default qlen 1000
    link/ether 52:54:00:b0:13:16 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.4/24 brd 10.0.3.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:feb0:1316/64 scope link 
       valid_lft forever preferred_lft forever

----

in parallel:

(k)rafaeldtinoco@kcluster01:~$ journalctl -f -u pacemaker

and check if events are generated (vip monitor detects changes)

----

(k)rafaeldtinoco@kcluster01:~$ systemctl restart systemd-networkd

----

No VIP changes:

(k)rafaeldtinoco@kcluster01:~$ for name in kcluster01 kcluster02
kcluster03; do ssh $name "hostname ; ip addr show eth3"; done

kcluster01
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP 
group default qlen 1000
    link/ether 52:54:00:11:a0:03 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.2/24 brd 10.0.3.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet 10.0.3.1/24 brd 10.0.3.255 scope global secondary eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe11:a003/64 scope link 
       valid_lft forever preferred_lft forever
kcluster02
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP 
group default qlen 1000
    link/ether 52:54:00:1d:1a:cc brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.3/24 brd 10.0.3.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe1d:1acc/64 scope link 
       valid_lft forever preferred_lft forever
kcluster03
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP 
group default qlen 1000
    link/ether 52:54:00:b0:13:16 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.4/24 brd 10.0.3.255 scope global eth3
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:feb0:1316/64 scope link 
       valid_lft forever preferred_lft forever

and no events generated!

verification-done


** Tags removed: verification-needed verification-needed-eoan
** Tags added: verification-done verification-done-eoan

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to keepalived in Ubuntu.
https://bugs.launchpad.net/bugs/1815101

Title:
  [master] Restarting systemd-networkd breaks keepalived, heartbeat,
  corosync, pacemaker (interface aliases are restarted)

Status in Keepalived Charm:
  New
Status in netplan:
  Confirmed
Status in heartbeat package in Ubuntu:
  Triaged
Status in keepalived package in Ubuntu:
  In Progress
Status in systemd package in Ubuntu:
  In Progress
Status in heartbeat source package in Bionic:
  Triaged
Status in keepalived source package in Bionic:
  Confirmed
Status in systemd source package in Bionic:
  Confirmed
Status in heartbeat source package in Disco:
  Triaged
Status in keepalived source package in Disco:
  Confirmed
Status in systemd source package in Disco:
  Confirmed
Status in heartbeat source package in Eoan:
  Triaged
Status in keepalived source package in Eoan:
  In Progress
Status in systemd source package in Eoan:
  Fix Committed

Bug description:
  [impact]

  - ALL related HA software has a small problem if interfaces are being
  managed by systemd-networkd: nic restarts/reconfigs are always going
  to wipe all interfaces aliases when HA software is not expecting it to
  (no coordination between them.

  - keepalived, smb ctdb, pacemaker, all suffer from this. Pacemaker is
  smarter in this case because it has a service monitor that will
  restart the virtual IP resource, in affected node & nic, before
  considering a real failure, but other HA service might consider a real
  failure when it is not.

  [test case]

  - comment #14 is a full test case: to have 3 node pacemaker, in that
  example, and cause a networkd service restart: it will trigger a
  failure for the virtual IP resource monitor.

  - other example is given in the original description for keepalived.
  both suffer from the same issue (and other HA softwares as well).

  [regression potential]

  - this backports KeepConfiguration parameter, which adds some
  significant complexity to networkd's configuration and behavior, which
  could lead to regressions in correctly configuring the network at
  networkd start, or incorrectly maintaining configuration at networkd
  restart, or losing network state at networkd stop.

  - Any regressions are most likely to occur during networkd start,
  restart, or stop, and most likely to involve missing or incorrect ip
  address(es).

  - the change is based in upstream patches adding the exact feature we
  needed to fix this issue & it will be integrated with a netplan change
  to add the needed stanza to systemd nic configuration file
  (KeepConfiguration=)

  [other info]

  original description:
  ---

  Configure netplan for interfaces, for example (a working config with
  IP addresses obfuscated)

  network:
      ethernets:
          eth0:
              addresses: [192.168.0.5/24]
              dhcp4: false
              nameservers:
                search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, 
phone.blah.com]
                addresses: [10.22.11.1]
          eth2:
              addresses:
                - 12.13.14.18/29
                - 12.13.14.19/29
              gateway4: 12.13.14.17
              dhcp4: false
              nameservers:
                search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, 
phone.blah.com]
                addresses: [10.22.11.1]
          eth3:
              addresses: [10.22.11.6/24]
              dhcp4: false
              nameservers:
                search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, 
phone.blah.com]
                addresses: [10.22.11.1]
          eth4:
              addresses: [10.22.14.6/24]
              dhcp4: false
              nameservers:
                search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, 
phone.blah.com]
                addresses: [10.22.11.1]
          eth7:
              addresses: [9.5.17.34/29]
              dhcp4: false
              optional: true
              nameservers:
                search: [blah.com, other.blah.com, hq.blah.com, cust.blah.com, 
phone.blah.com]
                addresses: [10.22.11.1]
      version: 2

  Configure keepalived (again, a working config with IP addresses
  obfuscated)

  global_defs           # Block id
  {
  notification_email {
          sysadm...@blah.com
  }
          notification_email_from keepali...@system3.hq.blah.com
          smtp_server 10.22.11.7     # IP
          smtp_connect_timeout 30      # integer, seconds
          router_id system3          # string identifying the machine,
                                       # (doesn't have to be hostname).
          vrrp_mcast_group4 224.0.0.18 # optional, default 224.0.0.18
          vrrp_mcast_group6 ff02::12   # optional, default ff02::12
          enable_traps                 # enable SNMP traps
  }
  vrrp_sync_group collection {
          group {
                  wan
                  lan
                  phone
          }
  vrrp_instance wan {
          state MASTER
          interface eth2
          virtual_router_id 77
          priority 150
          advert_int 1
          smtp_alert
          authentication {
                  auth_type PASS
                  auth_pass BlahBlah
          }
          virtual_ipaddress {
          12.13.14.20
          }
  }
  vrrp_instance lan {
          state MASTER
          interface eth3
          virtual_router_id 78
          priority 150
          advert_int 1
          smtp_alert
          authentication {
                  auth_type PASS
                  auth_pass MoreBlah
          }
          virtual_ipaddress {
                  10.22.11.13/24
          }
  }
  vrrp_instance phone {
          state MASTER
          interface eth4
          virtual_router_id 79
          priority 150
          advert_int 1
          smtp_alert
          authentication {
                  auth_type PASS
                  auth_pass MostBlah
          }
          virtual_ipaddress {
                  10.22.14.3/24
          }
  }

  At boot the affected interfaces have:
  5: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
      link/ether ab:cd:ef:90:c0:e3 brd ff:ff:ff:ff:ff:ff
      inet 10.22.14.6/24 brd 10.22.14.255 scope global eth4
         valid_lft forever preferred_lft forever
      inet 10.22.14.3/24 scope global secondary eth4
         valid_lft forever preferred_lft forever
      inet6 fe80::ae1f:6bff:fe90:c0e3/64 scope link
         valid_lft forever preferred_lft forever
  7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
      link/ether ab:cd:ef:b0:26:29 brd ff:ff:ff:ff:ff:ff
      inet 10.22.11.6/24 brd 10.22.11.255 scope global eth3
         valid_lft forever preferred_lft forever
      inet 10.22.11.13/24 scope global secondary eth3
         valid_lft forever preferred_lft forever
      inet6 fe80::ae1f:6bff:feb0:2629/64 scope link
         valid_lft forever preferred_lft forever
  9: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
      link/ether ab:cd:ef:b0:26:2b brd ff:ff:ff:ff:ff:ff
      inet 12.13.14.18/29 brd 12.13.14.23 scope global eth2
         valid_lft forever preferred_lft forever
      inet 12.13.14.20/32 scope global eth2
         valid_lft forever preferred_lft forever
      inet 12.33.89.19/29 brd 12.13.14.23 scope global secondary eth2
         valid_lft forever preferred_lft forever
      inet6 fe80::ae1f:6bff:feb0:262b/64 scope link
         valid_lft forever preferred_lft forever

  Run 'netplan try' (didn't even make any changes to the configuration) and the 
keepalived addresses disappear never to return, the affected interfaces have:
  5: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
      link/ether ab:cd:ef:90:c0:e3 brd ff:ff:ff:ff:ff:ff
      inet 10.22.14.6/24 brd 10.22.14.255 scope global eth4
         valid_lft forever preferred_lft forever
      inet6 fe80::ae1f:6bff:fe90:c0e3/64 scope link
         valid_lft forever preferred_lft forever
  7: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
      link/ether ab:cd:ef:b0:26:29 brd ff:ff:ff:ff:ff:ff
      inet 10.22.11.6/24 brd 10.22.11.255 scope global eth3
         valid_lft forever preferred_lft forever
      inet6 fe80::ae1f:6bff:feb0:2629/64 scope link
         valid_lft forever preferred_lft forever
  9: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group 
default qlen 1000
      link/ether ab:cd:ef:b0:26:2b brd ff:ff:ff:ff:ff:ff
      inet 12.13.14.18/29 brd 12.13.14.23 scope global eth2
         valid_lft forever preferred_lft forever
      inet 12.33.89.19/29 brd 12.13.14.23 scope global secondary eth2
         valid_lft forever preferred_lft forever
      inet6 fe80::ae1f:6bff:feb0:262b/64 scope link
         valid_lft forever preferred_lft forever

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-keepalived/+bug/1815101/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : ubuntu-ha@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1815101] Re: [master] Restarting systemd-networkd breaks keepalived, heartbeat, corosync, pacemaker (interface aliases are restarted)

Reply via email to