[Expired for neutron because there has been no activity for 60 days.] ** Changed in: neutron Status: Incomplete => Expired
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2077533 Title: An error in processing one DVR router can lead to connectivity issues for other routers Status in neutron: Expired Bug description: I investigated the customer's issue and concluded that this code: https://opendev.org/openstack/neutron/src/commit/0807c94dc9843fff318c21d1f6f7b8838f948f5f/neutron/agent/l3/dvr_fip_ns.py#L155-L160 which deletes the fip-namespace during router processing, leads to connectivity problems for other routers. This deletion of the fip-namespace also removes the veth pairs rfp/fpr for other routers. However, the reprocessing of those other routers does not occur. As a result, all other routers, except the one that triggered the deletion of the fip-namespace, are left without the rfp/fpr veth pair. The issue might be difficult to trigger, so I'll demonstrate it with a small hack: --- a/neutron/agent/l3/dvr_fip_ns.py +++ b/neutron/agent/l3/dvr_fip_ns.py @@ -151,6 +151,11 @@ class FipNamespace(namespaces.Namespace): try: self._update_gateway_port( agent_gateway_port, interface_name) + if getattr(self, 'test_fail', False): + self.test_fail = False + raise Exception('Test Fail') + else: + self.test_fail = True except Exception: # If an exception occurs at this point, then it is # good to clean up the namespace that has been created 1) I create two routers with the same external network: [root@devstack0 ~]# openstack router create r1 --external-gateway public -c id +-------+--------------------------------------+ | Field | Value | +-------+--------------------------------------+ | id | 25085e63-45a6-4795-93dc-77cb245664d7 | +-------+--------------------------------------+ [root@devstack0 ~]# openstack router create r2 --external-gateway public -c id +-------+--------------------------------------+ | Field | Value | +-------+--------------------------------------+ | id | 3805cd53-5fed-4fa3-9147-f396761fc9cd | +-------+--------------------------------------+ [root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 <cut> 2: fpr-25085e63-4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether 8e:e9:5e:65:9c:ad brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7 inet 169.254.120.3/31 scope global fpr-25085e63-4 valid_lft forever preferred_lft forever inet6 fe80::8ce9:5eff:fe65:9cad/64 scope link valid_lft forever preferred_lft forever 3: fpr-3805cd53-5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether 12:e1:bf:02:98:e0 brd ff:ff:ff:ff:ff:ff link-netns qrouter-3805cd53-5fed-4fa3-9147-f396761fc9cd inet 169.254.77.247/31 scope global fpr-3805cd53-5 valid_lft forever preferred_lft forever inet6 fe80::10e1:bfff:fe02:98e0/64 scope link valid_lft forever preferred_lft forever 68: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fee4:eae1/64 scope link valid_lft forever preferred_lft forever [root@devstack0 ~]# 2) I trigger an update of router r1 with a failure (see hack), which leads to the deletion of the fip-namespace and reprocessing of this router. Updating r1 causes the loss of the veth rfp/fpr pair for router r2, thus breaking router r2. [root@devstack0 ~]# openstack router set r1 --name r1-updated [root@devstack0 ~]# ip netns exec fip-dad747c6-c234-41e3-ae27-c9602b81fbd2 ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: fpr-25085e63-4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether fa:68:ef:86:96:a5 brd ff:ff:ff:ff:ff:ff link-netns qrouter-25085e63-45a6-4795-93dc-77cb245664d7 inet 169.254.120.3/31 scope global fpr-25085e63-4 valid_lft forever preferred_lft forever inet6 fe80::f868:efff:fe86:96a5/64 scope link valid_lft forever preferred_lft forever 71: fg-21441edb-3f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:e4:ea:e1 brd ff:ff:ff:ff:ff:ff inet 10.20.30.95/24 brd 10.20.30.255 scope global fg-21441edb-3f valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fee4:eae1/64 scope link valid_lft forever preferred_lft forever [root@devstack0 ~]# P.S. I investigated a customer issue where they reported internet connectivity loss through their routers. In short, the trigger was a bug I recently created: https://bugs.launchpad.net/neutron/+bug/2077532, where the existence of two floatingip_agent_gateways ports led to an error in _update_gateway_port, which subsequently caused the deletion of veth pairs from all routers depending on the order in which the ports were returned. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2077533/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp