[Yahoo-eng-team] [Bug 1957794] [NEW] qrouter ns leak while last service port delete because of router gw port

2022-01-13 Thread Krzysztof Tomaszewski
Public bug reported:

While removing last port from the subnet on compute host with DVR then
L3 agent is cleaning unneeded qrouter-* namespaces.

When you have a different (even other user) VM on the same host that has port
from the subnet that your router has a gateway then deleting of qrouter 
namespaces is not triggered.

Scenario to reproduce:

Two instances multinode devstack master; no dhcp agent (for simplicity);
devstack default DVR router preconfiguration (public net as a default GW, 
private net as a subnet);
two nodes:
 - devstack1 - dvr_snat node,
 - devstack2 - dvr node

1) create a VM with private network on devstack2 node as a demo user:

(demo)$ openstack server create --net private --flavor cirros256 --image 
cirros-0.5.2-x86_64-disk test_private
(demo)$ openstack server show test_private -c id
+---+--+
| Field | Value|
+---+--+
| id| 7e5bebfd-636d-4416-b2ce-7f16a7b720ca |
+---+--+
(demo)$ openstack port list --device-id 7e5bebfd-636d-4416-b2ce-7f16a7b720ca -c 
id
+--+
| ID   |
+--+
| d359efe3-8075-483a-90ee-807595d8786a |
+--+

There is proper tap interface and L3 agent creates qrouter-* namespace:

stack@devstack2:~/$ sudo ip netns | grep qr
qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0)
stack@devstack2:~/$ ip a | grep d359
28: tapd359efe3-80:  mtu 1450 qdisc fq_codel 
master ovs-system state UNKNOWN group default qlen 1000
stack@devstack2:~$ sudo ovs-vsctl get port tapd359efe3-80 tag
4
stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port 
tag=4
name   tag
-- ---
qr-c3ae7e60-aa 4
qr-7f7c0893-f7 4
tapd359efe3-80 4

2) create a VM with public network on devstack2 node as an admin user:

(admin)$ openstack server create --net public --flavor cirros256 --image 
cirros-0.5.2-x86_64-disk test_public
(admin)$ openstack server show test_public -c OS-EXT-SRV-ATTR:host -c id -c 
OS-EXT-STS:power_state -c OS-EXT-STS:vm_state
++--+
| Field  | Value|
++--+
| OS-EXT-SRV-ATTR:host   | devstack2|
| OS-EXT-STS:power_state | Running  |
| OS-EXT-STS:vm_state| active   |
| id | 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 |
++--+
(admin)$ openstack port list --device-id 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 
-c id
+--+
| ID   |
+--+
| dc822c75-715e-4788-9589-3fff05ccc307 |
+--+

stack@devstack2:~$ ip a | grep dc8
14: tapdc822c75-71:  mtu 1500 qdisc fq_codel 
master ovs-system state UNKNOWN group default qlen 1000

3) delete demo user test_private VM

(demo)$ openstack server delete test_private

VM is deleted but qrouter-* namespaces stays.

One VM only exist (admin's one):
stack@devstack2:~$ sudo virsh list --all
 Id   NameState
---
 2instance-0007   running

stack@devstack2:~$ sudo ip netns | grep qr
qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0)
stack@devstack2:~$
stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port 
tag=4
name   tag
-- ---
qr-c3ae7e60-aa 4
qr-7f7c0893-f7 4

To clear this namespace you need to full resync L3 agent by agent
restart or by disabling/enabling agent:

(admin)$ openstack network agent list --host devstack2 --agent-type l3 -c ID -c 
Host
+--+---+
| ID   | Host  |
+--+---+
| 77b01aa0-de3b-4b6b-a40a-08031460a97f | devstack2 |
+--+---+

(admin)$ openstack network agent set --disable 
77b01aa0-de3b-4b6b-a40a-08031460a97f
(admin)$ openstack network agent set --enable 
77b01aa0-de3b-4b6b-a40a-08031460a97f

and qrouter-* namespace disappear:

stack@devstack2:~$ sudo ip netns | grep qr
stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port 
tag=4
name tag
 ---
stack@devstack2:~$

** Affects: neutron
 Importance: Undecided
 Assignee: Krzysztof Tomaszewski (labedz)
 Status: New


** Tags: l3-dvr-backlog

** Changed in: neutron
 Assignee: (unassigned) => Krzysztof Tomaszewski (labedz)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1957794

Title:
  qrouter

[Yahoo-eng-team] [Bug 1959151] [NEW] Don't set HA ports down while L3 agent restart.

2022-01-26 Thread Krzysztof Tomaszewski
Public bug reported:

Because of the fix for bug #1597461[1] L3 agent puts all it's
HA ports down during initialization phase. Unfortunately such
operation can break already working L3 communication when
you restart agent service (rewiring port from down state to
up can takes few seconds and some VRRP packages could be lost
so router HA state change may be triggered).

This is an effect of calling:
self.plugin_rpc.update_all_ha_network_port_statuses
in neutron/agent/l3/agent.py#L393 during L3 agent
initialization phase in _check_ha_router_process_status.

Restarting agent process should not affect already working
configuration (customer traffic).

Possibly workaround would be to put HA ports to DOWN state
only on host restart and not on every L3 agent restart.

[1] https://bugs.launchpad.net/neutron/+bug/1597461

** Affects: neutron
 Importance: Undecided
 Assignee: Krzysztof Tomaszewski (labedz)
 Status: In Progress


** Tags: l3-dvr-backlog

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1959151

Title:
  Don't set HA ports down while L3 agent restart.

Status in neutron:
  In Progress

Bug description:
  Because of the fix for bug #1597461[1] L3 agent puts all it's
  HA ports down during initialization phase. Unfortunately such
  operation can break already working L3 communication when
  you restart agent service (rewiring port from down state to
  up can takes few seconds and some VRRP packages could be lost
  so router HA state change may be triggered).

  This is an effect of calling:
  self.plugin_rpc.update_all_ha_network_port_statuses
  in neutron/agent/l3/agent.py#L393 during L3 agent
  initialization phase in _check_ha_router_process_status.

  Restarting agent process should not affect already working
  configuration (customer traffic).

  Possibly workaround would be to put HA ports to DOWN state
  only on host restart and not on every L3 agent restart.

  [1] https://bugs.launchpad.net/neutron/+bug/1597461

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1959151/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1970216] [NEW] not sending Nova notification when using neutron API on mod_wsgi

2022-04-25 Thread Krzysztof Tomaszewski
Public bug reported:

When using Neutron API server with Apache mod_wsgi then
notification to Nova are not always send and lost.

How to reproduce: Ubuntu 20.04; upstream devstack master configured
with Apache mod_wsgi [1]

stack@devstack11:~$ openstack server create --availability-zone nova:devstack11 
--net public --key-name test --flavor ds512M --image ubuntu t1
stack@devstack11:~$ openstack port create --net private test_port; openstack 
server add port t1 test_port

VM get it's port properly:

stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id
+---+--+
| Field | Value|
+---+--+
| device_id | 59d7e21b-7d43-4ee6-a82b-e43d12c6e9ae |
| id| 4ed67e46-535e-4a51-af95-f46497ecfdb4 |
+---+--+

stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist 
instance-0015
 InterfaceType   Source   ModelMAC
--
 tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
 tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a

but when deleting port from neutron API:

stack@devstack11:~/neutron$ openstack port delete test_port

on libvirt:

stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist 
instance-0015
 InterfaceType   Source   ModelMAC
--
 tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
 tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a

and on neutron side:
stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id
No Port found for test_port

after few tries you can end up with:

stack@devstack11:~/neutron$ openstack port show test_port -c id
No Port found for test_port

and

stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist 
instance-0015
 InterfaceType   Source   ModelMAC
--
 tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
 tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a
 tap2b20446c-3d   ethernet   -virtio   fa:16:3e:22:6c:19
 tapea396111-d3   ethernet   -virtio   fa:16:3e:c3:38:4a

[1] https://docs.openstack.org/neutron/yoga/admin/config-
wsgi.html#neutron-api-behind-mod-wsgi

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

  When using Neutron API server with Apache mod_wsgi then
  notification to Nova are not always send and lost.
  
  How to reproduce: Ubuntu 20.04; upstream devstack master configured
  with Apache mod_wsgi [1]
  
- stack@devstack11:~$ openstack server create --availability-zone 
nova:devstack11 --net public --key-name test --flavor ds512M --image ubuntu 
test1
+ stack@devstack11:~$ openstack server create --availability-zone 
nova:devstack11 --net public --key-name test --flavor ds512M --image ubuntu t1
  stack@devstack11:~$ openstack port create --net private test_port; openstack 
server add port t1 test_port
  
  VM get it's port properly:
  
  stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id
  +---+--+
  | Field | Value|
  +---+--+
  | device_id | 59d7e21b-7d43-4ee6-a82b-e43d12c6e9ae |
  | id| 4ed67e46-535e-4a51-af95-f46497ecfdb4 |
  +---+--+
  
  stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist 
instance-0015
-  InterfaceType   Source   ModelMAC
+  InterfaceType   Source   ModelMAC
  --
-  tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
-  tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a
+  tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
+  tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a
  
  but when deleting port from neutron API:
  
  stack@devstack11:~/neutron$ openstack port delete test_port
  
  on libvirt:
  
  stack@devstack11:/usr/lib/cgi-bin/neutron$ sudo virsh domiflist 
instance-0015
-  InterfaceType   Source   ModelMAC
+  InterfaceType   Source   ModelMAC
  --
-  tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
-  tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a
- 
+  tapbef7c488-e4   ethernet   -virtio   fa:16:3e:d1:23:72
+  tap4ed67e46-53   ethernet   -virtio   fa:16:3e:60:3f:1a
  
  and on neutron side:
  stack@devstack11:~/devstack$ openstack port show test_port -c id -c device_id
  No Port found for test_port
- 
  
  

[Yahoo-eng-team] [Bug 1957794] Re: qrouter ns leak while last service port delete because of router gw port

2022-08-05 Thread Krzysztof Tomaszewski
** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1957794

Title:
  qrouter ns leak while last service port delete because of router gw
  port

Status in neutron:
  Fix Released

Bug description:
  While removing last port from the subnet on compute host with DVR then
  L3 agent is cleaning unneeded qrouter-* namespaces.

  When you have a different (even other user) VM on the same host that has port
  from the subnet that your router has a gateway then deleting of qrouter 
namespaces is not triggered.

  Scenario to reproduce:

  Two instances multinode devstack master; no dhcp agent (for simplicity);
  devstack default DVR router preconfiguration (public net as a default GW, 
private net as a subnet);
  two nodes:
   - devstack1 - dvr_snat node,
   - devstack2 - dvr node

  1) create a VM with private network on devstack2 node as a demo user:

  (demo)$ openstack server create --net private --flavor cirros256 --image 
cirros-0.5.2-x86_64-disk test_private
  (demo)$ openstack server show test_private -c id
  +---+--+
  | Field | Value|
  +---+--+
  | id| 7e5bebfd-636d-4416-b2ce-7f16a7b720ca |
  +---+--+
  (demo)$ openstack port list --device-id 7e5bebfd-636d-4416-b2ce-7f16a7b720ca 
-c id
  +--+
  | ID   |
  +--+
  | d359efe3-8075-483a-90ee-807595d8786a |
  +--+

  There is proper tap interface and L3 agent creates qrouter-*
  namespace:

  stack@devstack2:~/$ sudo ip netns | grep qr
  qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0)
  stack@devstack2:~/$ ip a | grep d359
  28: tapd359efe3-80:  mtu 1450 qdisc fq_codel 
master ovs-system state UNKNOWN group default qlen 1000
  stack@devstack2:~$ sudo ovs-vsctl get port tapd359efe3-80 tag
  4
  stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port 
tag=4
  name   tag
  -- ---
  qr-c3ae7e60-aa 4
  qr-7f7c0893-f7 4
  tapd359efe3-80 4

  2) create a VM with public network on devstack2 node as an admin user:

  (admin)$ openstack server create --net public --flavor cirros256 --image 
cirros-0.5.2-x86_64-disk test_public
  (admin)$ openstack server show test_public -c OS-EXT-SRV-ATTR:host -c id -c 
OS-EXT-STS:power_state -c OS-EXT-STS:vm_state
  ++--+
  | Field  | Value|
  ++--+
  | OS-EXT-SRV-ATTR:host   | devstack2|
  | OS-EXT-STS:power_state | Running  |
  | OS-EXT-STS:vm_state| active   |
  | id | 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 |
  ++--+
  (admin)$ openstack port list --device-id 0622fd62-bb3e-4d36-bbcd-d0c8f8b14cc9 
-c id
  +--+
  | ID   |
  +--+
  | dc822c75-715e-4788-9589-3fff05ccc307 |
  +--+

  stack@devstack2:~$ ip a | grep dc8
  14: tapdc822c75-71:  mtu 1500 qdisc fq_codel 
master ovs-system state UNKNOWN group default qlen 1000

  3) delete demo user test_private VM

  (demo)$ openstack server delete test_private

  VM is deleted but qrouter-* namespaces stays.

  One VM only exist (admin's one):
  stack@devstack2:~$ sudo virsh list --all
   Id   NameState
  ---
   2instance-0007   running

  stack@devstack2:~$ sudo ip netns | grep qr
  qrouter-0a5fc7cf-0ed9-4fb9-921b-4ed95ef3924b (id: 0)
  stack@devstack2:~$
  stack@devstack2:~$ sudo ovs-vsctl --format=table --columns=name,tag find port 
tag=4
  name   tag
  -- ---
  qr-c3ae7e60-aa 4
  qr-7f7c0893-f7 4

  To clear this namespace you need to full resync L3 agent by agent
  restart or by disabling/enabling agent:

  (admin)$ openstack network agent list --host devstack2 --agent-type l3 -c ID 
-c Host
  +--+---+
  | ID   | Host  |
  +--+---+
  | 77b01aa0-de3b-4b6b-a40a-08031460a97f | devstack2 |
  +--+---+

  (admin)$ openstack network agent set --disable 
77b01aa0-de3b-4b6b-a40a-08031460a97f
  (admin)$ openstack network agent set --enable 
77b01aa0-de3b-4b6b-a40a-08031460a97f

  and qrouter-* namespace disappear:

  stack@devstack2:~$ sudo ip netns | grep qr
  stack@devstack2:~$ sudo

[Yahoo-eng-team] [Bug 1991817] [NEW] OVN metadata agent liveness system generate OVN SBDB usage peak

2022-10-05 Thread Krzysztof Tomaszewski
Public bug reported:

On bigger scale deployments (150+ compute hosts) neutron-ovn-metadata-
agent liveness system generates CPU usage peak on OVN Southbound DB
system every period of time (agent_down_time / 2). This CPU saturation
time can takes dozens of seconds and it introduces a significant latency
in OVN service response.

Problem is that every neutron-ovn-metadata-agent is instantly responding on 
event on SB_Global table and updates it's corresponding Chassis/Chassis_Private 
table external_ids property.
That generate flood of OVN SBDB updates.

Similar issue can be observed on different neutron agents that are using
oslo.messaging system to deliver it's heartbeats (like neutron ovs
agent) but in those cases the load generated by liveness system can be
distributed in time just by different agent execution time.

neutron-ovn-metadata-agent heartbeat does not rely on the agent execute
time but is triggered by general OVN event.

Solution could be to distribute neutron-ovn-metadata-agent heartbeat
update time just by postponing it's answer in randomized period of time
(where delay time range is not exceeding agent_down_time / 2 parameter).

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1991817

Title:
  OVN metadata agent liveness system generate OVN SBDB usage peak

Status in neutron:
  New

Bug description:
  On bigger scale deployments (150+ compute hosts) neutron-ovn-metadata-
  agent liveness system generates CPU usage peak on OVN Southbound DB
  system every period of time (agent_down_time / 2). This CPU saturation
  time can takes dozens of seconds and it introduces a significant
  latency in OVN service response.

  Problem is that every neutron-ovn-metadata-agent is instantly responding on 
event on SB_Global table and updates it's corresponding Chassis/Chassis_Private 
table external_ids property.
  That generate flood of OVN SBDB updates.

  Similar issue can be observed on different neutron agents that are
  using oslo.messaging system to deliver it's heartbeats (like neutron
  ovs agent) but in those cases the load generated by liveness system
  can be distributed in time just by different agent execution time.

  neutron-ovn-metadata-agent heartbeat does not rely on the agent
  execute time but is triggered by general OVN event.

  Solution could be to distribute neutron-ovn-metadata-agent heartbeat
  update time just by postponing it's answer in randomized period of
  time (where delay time range is not exceeding agent_down_time / 2
  parameter).

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1991817/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp