[Yahoo-eng-team] [Bug 1957794] [NEW] qrouter ns leak while last service port delete because of router gw port
Public bug reported: While removing last port from the subnet on compute host with DVR then L3 agent is cleaning unneeded qrouter-* namespaces. When you have a different (even other user) VM on the same host that has port from the subnet that your router has a gateway then deleting of qrouter namespaces is not triggered. Scenario to reproduce: Two instances multinode devstack master; no dhcp agent (for simplicity); devstack default DVR router preconfiguration (public net as a default GW, private net as a subnet); two nodes: - devstack1 - dvr_snat node, - devstack2 - dvr node 1) create a VM with private network on devstack2 node as a demo user: (demo)$ openstack server create --net private --flavor cirros256 --image cirros-0.5.2-x86_64-disk test_private (demo)$ openstack server show test_private -c id +---+--+ | Field | Value| +---+--+ | id| 7e5bebfd-636d-4416-b2ce-7f16a7b720ca | +---+--+ (demo)$ openstack port list --device-id 7e5bebfd-636d-4416-b2ce-7f16a7b720ca -c id +--+ | ID | +--+ | d359efe3-8075-483a-90ee-807595d8786a | +--+ There is proper tap interface and L3 agent creates qrouter-* namespace: stack@devstack2:~/$ sudo ip netns | grep qr qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0) stack@devstack2:~/$ ip a | grep d359 28: tapd359efe3-80: mtu 1450 qdisc fq_codel master ovs-system state UNKNOWN group default qlen 1000 stack@devstack2:~$ sudo ovs-vsctl get port tapd359efe3-80 tag 4 stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port tag=4 name tag -- --- qr-c3ae7e60-aa 4 qr-7f7c0893-f7 4 tapd359efe3-80 4 2) create a VM with public network on devstack2 node as an admin user: (admin)$ openstack server create --net public --flavor cirros256 --image cirros-0.5.2-x86_64-disk test_public (admin)$ openstack server show test_public -c OS-EXT-SRV-ATTR:host -c id -c OS-EXT-STS:power_state -c OS-EXT-STS:vm_state ++--+ | Field | Value| ++--+ | OS-EXT-SRV-ATTR:host | devstack2| | OS-EXT-STS:power_state | Running | | OS-EXT-STS:vm_state| active | | id | 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 | ++--+ (admin)$ openstack port list --device-id 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 -c id +--+ | ID | +--+ | dc822c75-715e-4788-9589-3fff05ccc307 | +--+ stack@devstack2:~$ ip a | grep dc8 14: tapdc822c75-71: mtu 1500 qdisc fq_codel master ovs-system state UNKNOWN group default qlen 1000 3) delete demo user test_private VM (demo)$ openstack server delete test_private VM is deleted but qrouter-* namespaces stays. One VM only exist (admin's one): stack@devstack2:~$ sudo virsh list --all Id NameState --- 2instance-0007 running stack@devstack2:~$ sudo ip netns | grep qr qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0) stack@devstack2:~$ stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port tag=4 name tag -- --- qr-c3ae7e60-aa 4 qr-7f7c0893-f7 4 To clear this namespace you need to full resync L3 agent by agent restart or by disabling/enabling agent: (admin)$ openstack network agent list --host devstack2 --agent-type l3 -c ID -c Host +--+---+ | ID | Host | +--+---+ | 77b01aa0-de3b-4b6b-a40a-08031460a97f | devstack2 | +--+---+ (admin)$ openstack network agent set --disable 77b01aa0-de3b-4b6b-a40a-08031460a97f (admin)$ openstack network agent set --enable 77b01aa0-de3b-4b6b-a40a-08031460a97f and qrouter-* namespace disappear: stack@devstack2:~$ sudo ip netns | grep qr stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port tag=4 name tag --- stack@devstack2:~$ ** Affects: neutron Importance: Undecided Assignee: Krzysztof Tomaszewski (labedz) Status: New ** Tags: l3-dvr-backlog ** Changed in: neutron Assignee: (unassigned) => Krzysztof Tomaszewski (labedz) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1957794 Title: qrouter
[Yahoo-eng-team] [Bug 1959151] [NEW] Don't set HA ports down while L3 agent restart.
Public bug reported: Because of the fix for bug #1597461[1] L3 agent puts all it's HA ports down during initialization phase. Unfortunately such operation can break already working L3 communication when you restart agent service (rewiring port from down state to up can takes few seconds and some VRRP packages could be lost so router HA state change may be triggered). This is an effect of calling: self.plugin_rpc.update_all_ha_network_port_statuses in neutron/agent/l3/agent.py#L393 during L3 agent initialization phase in _check_ha_router_process_status. Restarting agent process should not affect already working configuration (customer traffic). Possibly workaround would be to put HA ports to DOWN state only on host restart and not on every L3 agent restart. [1] https://bugs.launchpad.net/neutron/+bug/1597461 ** Affects: neutron Importance: Undecided Assignee: Krzysztof Tomaszewski (labedz) Status: In Progress ** Tags: l3-dvr-backlog -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1959151 Title: Don't set HA ports down while L3 agent restart. Status in neutron: In Progress Bug description: Because of the fix for bug #1597461[1] L3 agent puts all it's HA ports down during initialization phase. Unfortunately such operation can break already working L3 communication when you restart agent service (rewiring port from down state to up can takes few seconds and some VRRP packages could be lost so router HA state change may be triggered). This is an effect of calling: self.plugin_rpc.update_all_ha_network_port_statuses in neutron/agent/l3/agent.py#L393 during L3 agent initialization phase in _check_ha_router_process_status. Restarting agent process should not affect already working configuration (customer traffic). Possibly workaround would be to put HA ports to DOWN state only on host restart and not on every L3 agent restart. [1] https://bugs.launchpad.net/neutron/+bug/1597461 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1959151/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1970216] [NEW] not sending Nova notification when using neutron API on mod_wsgi
Public bug reported: When using Neutron API server with Apache mod_wsgi then notification to Nova are not always send and lost. How to reproduce: Ubuntu 20.04; upstream devstack master configured with Apache mod_wsgi [1] stack@devstack11:~$ openstack server create --availability-zone nova:devstack11 --net public --key-name test --flavor ds512M --image ubuntu t1 stack@devstack11:~$ openstack port create --net private test_port; openstack server add port t1 test_port VM get it's port properly: stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id +---+--+ | Field | Value| +---+--+ | device_id | 59d7e21b-7d43-4ee6-a82b-e43d12c6e9ae | | id| 4ed67e46-535e-4a51-af95-f46497ecfdb4 | +---+--+ stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist instance-0015 InterfaceType Source ModelMAC -- tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a but when deleting port from neutron API: stack@devstack11:~/neutron$ openstack port delete test_port on libvirt: stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist instance-0015 InterfaceType Source ModelMAC -- tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a and on neutron side: stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id No Port found for test_port after few tries you can end up with: stack@devstack11:~/neutron$ openstack port show test_port -c id No Port found for test_port and stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist instance-0015 InterfaceType Source ModelMAC -- tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a tap2b20446c-3d ethernet -virtio fa:16:3e:22:6c:19 tapea396111-d3 ethernet -virtio fa:16:3e:c3:38:4a [1] https://docs.openstack.org/neutron/yoga/admin/config- wsgi.html#neutron-api-behind-mod-wsgi ** Affects: neutron Importance: Undecided Status: New ** Description changed: When using Neutron API server with Apache mod_wsgi then notification to Nova are not always send and lost. How to reproduce: Ubuntu 20.04; upstream devstack master configured with Apache mod_wsgi [1] - stack@devstack11:~$ openstack server create --availability-zone nova:devstack11 --net public --key-name test --flavor ds512M --image ubuntu test1 + stack@devstack11:~$ openstack server create --availability-zone nova:devstack11 --net public --key-name test --flavor ds512M --image ubuntu t1 stack@devstack11:~$ openstack port create --net private test_port; openstack server add port t1 test_port VM get it's port properly: stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id +---+--+ | Field | Value| +---+--+ | device_id | 59d7e21b-7d43-4ee6-a82b-e43d12c6e9ae | | id| 4ed67e46-535e-4a51-af95-f46497ecfdb4 | +---+--+ stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist instance-0015 - InterfaceType Source ModelMAC + InterfaceType Source ModelMAC -- - tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 - tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a + tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 + tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a but when deleting port from neutron API: stack@devstack11:~/neutron$ openstack port delete test_port on libvirt: stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist instance-0015 - InterfaceType Source ModelMAC + InterfaceType Source ModelMAC -- - tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 - tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a - + tapbef7c488-e4 ethernet -virtio fa:16:3e:d1:23:72 + tap4ed67e46-53 ethernet -virtio fa:16:3e:60:3f:1a and on neutron side: stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id No Port found for test_port -
[Yahoo-eng-team] [Bug 1957794] Re: qrouter ns leak while last service port delete because of router gw port
** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1957794 Title: qrouter ns leak while last service port delete because of router gw port Status in neutron: Fix Released Bug description: While removing last port from the subnet on compute host with DVR then L3 agent is cleaning unneeded qrouter-* namespaces. When you have a different (even other user) VM on the same host that has port from the subnet that your router has a gateway then deleting of qrouter namespaces is not triggered. Scenario to reproduce: Two instances multinode devstack master; no dhcp agent (for simplicity); devstack default DVR router preconfiguration (public net as a default GW, private net as a subnet); two nodes: - devstack1 - dvr_snat node, - devstack2 - dvr node 1) create a VM with private network on devstack2 node as a demo user: (demo)$ openstack server create --net private --flavor cirros256 --image cirros-0.5.2-x86_64-disk test_private (demo)$ openstack server show test_private -c id +---+--+ | Field | Value| +---+--+ | id| 7e5bebfd-636d-4416-b2ce-7f16a7b720ca | +---+--+ (demo)$ openstack port list --device-id 7e5bebfd-636d-4416-b2ce-7f16a7b720ca -c id +--+ | ID | +--+ | d359efe3-8075-483a-90ee-807595d8786a | +--+ There is proper tap interface and L3 agent creates qrouter-* namespace: stack@devstack2:~/$ sudo ip netns | grep qr qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0) stack@devstack2:~/$ ip a | grep d359 28: tapd359efe3-80: mtu 1450 qdisc fq_codel master ovs-system state UNKNOWN group default qlen 1000 stack@devstack2:~$ sudo ovs-vsctl get port tapd359efe3-80 tag 4 stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port tag=4 name tag -- --- qr-c3ae7e60-aa 4 qr-7f7c0893-f7 4 tapd359efe3-80 4 2) create a VM with public network on devstack2 node as an admin user: (admin)$ openstack server create --net public --flavor cirros256 --image cirros-0.5.2-x86_64-disk test_public (admin)$ openstack server show test_public -c OS-EXT-SRV-ATTR:host -c id -c OS-EXT-STS:power_state -c OS-EXT-STS:vm_state ++--+ | Field | Value| ++--+ | OS-EXT-SRV-ATTR:host | devstack2| | OS-EXT-STS:power_state | Running | | OS-EXT-STS:vm_state| active | | id | 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 | ++--+ (admin)$ openstack port list --device-id 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 -c id +--+ | ID | +--+ | dc822c75-715e-4788-9589-3fff05ccc307 | +--+ stack@devstack2:~$ ip a | grep dc8 14: tapdc822c75-71: mtu 1500 qdisc fq_codel master ovs-system state UNKNOWN group default qlen 1000 3) delete demo user test_private VM (demo)$ openstack server delete test_private VM is deleted but qrouter-* namespaces stays. One VM only exist (admin's one): stack@devstack2:~$ sudo virsh list --all Id NameState --- 2instance-0007 running stack@devstack2:~$ sudo ip netns | grep qr qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0) stack@devstack2:~$ stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port tag=4 name tag -- --- qr-c3ae7e60-aa 4 qr-7f7c0893-f7 4 To clear this namespace you need to full resync L3 agent by agent restart or by disabling/enabling agent: (admin)$ openstack network agent list --host devstack2 --agent-type l3 -c ID -c Host +--+---+ | ID | Host | +--+---+ | 77b01aa0-de3b-4b6b-a40a-08031460a97f | devstack2 | +--+---+ (admin)$ openstack network agent set --disable 77b01aa0-de3b-4b6b-a40a-08031460a97f (admin)$ openstack network agent set --enable 77b01aa0-de3b-4b6b-a40a-08031460a97f and qrouter-* namespace disappear: stack@devstack2:~$ sudo ip netns | grep qr stack@devstack2:~$ sudo
[Yahoo-eng-team] [Bug 1991817] [NEW] OVN metadata agent liveness system generate OVN SBDB usage peak
Public bug reported: On bigger scale deployments (150+ compute hosts) neutron-ovn-metadata- agent liveness system generates CPU usage peak on OVN Southbound DB system every period of time (agent_down_time / 2). This CPU saturation time can takes dozens of seconds and it introduces a significant latency in OVN service response. Problem is that every neutron-ovn-metadata-agent is instantly responding on event on SB_Global table and updates it's corresponding Chassis/Chassis_Private table external_ids property. That generate flood of OVN SBDB updates. Similar issue can be observed on different neutron agents that are using oslo.messaging system to deliver it's heartbeats (like neutron ovs agent) but in those cases the load generated by liveness system can be distributed in time just by different agent execution time. neutron-ovn-metadata-agent heartbeat does not rely on the agent execute time but is triggered by general OVN event. Solution could be to distribute neutron-ovn-metadata-agent heartbeat update time just by postponing it's answer in randomized period of time (where delay time range is not exceeding agent_down_time / 2 parameter). ** Affects: neutron Importance: Undecided Status: New ** Tags: ovn -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1991817 Title: OVN metadata agent liveness system generate OVN SBDB usage peak Status in neutron: New Bug description: On bigger scale deployments (150+ compute hosts) neutron-ovn-metadata- agent liveness system generates CPU usage peak on OVN Southbound DB system every period of time (agent_down_time / 2). This CPU saturation time can takes dozens of seconds and it introduces a significant latency in OVN service response. Problem is that every neutron-ovn-metadata-agent is instantly responding on event on SB_Global table and updates it's corresponding Chassis/Chassis_Private table external_ids property. That generate flood of OVN SBDB updates. Similar issue can be observed on different neutron agents that are using oslo.messaging system to deliver it's heartbeats (like neutron ovs agent) but in those cases the load generated by liveness system can be distributed in time just by different agent execution time. neutron-ovn-metadata-agent heartbeat does not rely on the agent execute time but is triggered by general OVN event. Solution could be to distribute neutron-ovn-metadata-agent heartbeat update time just by postponing it's answer in randomized period of time (where delay time range is not exceeding agent_down_time / 2 parameter). To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1991817/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp