Public bug reported:

In the following scenarios (especially in large-scale cases, when
restarting many ovs-agents at the same time), the openflow table is
missing and cannot be self-recovered

As a simple example, restarting two ovs-agent at the same time:
```
network.local_ip=30.0.1.6,output="vxlan-1e000106"
compute1.local_ip=30.0.1.7,output="vxlan-1e000107"
compute2.local_ip=30.0.1.8,output="vxlan-1e000108"

network.port=('192.168.1.2')
compute1.port=('192.168.1.11')
compute2.port=('192.168.1.141')


// iter_num=0 of compute1
DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - 
-] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding 
levels 
[PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)]
 get_binding_level_objs 
/usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver 
[req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, 
agent_active_ports: 3, refresh_tunnels: True update_port_up

// rpc-1
Notify l2population agent compute1 at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', 
ip_address='192.168.1.2')], '30.0.1.8': [('00:00:00:00:00:00', '0.0.0.0'), 
PortInfo(mac_address='fa:16:3e:45:eb:6a', ip_address='192.168.1.141')]}}} 
_notification_host

// rpc-2
Fanout notify l2population agents at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.7': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:21:34:43', 
ip_address='192.168.1.11')]}}} _notification_fanout

// iter_num>0 of compute1
DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - 
-] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding 
levels 
[PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)]
 get_binding_level_objs 
/usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
2022-06-09 17:45:39.546 833566 DEBUG 
neutron.plugins.ml2.drivers.l2pop.mech_driver 
[req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, 
agent_active_ports: 3, refresh_tunnels: False update_port_up 

...


// iter_num=0 of compute2
DEBUG neutron.plugins.ml2.db [req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - 
-] For port ccca9701-19c0-4590-92d0-5fbd909d4eeb, host compute2, got binding 
levels 
[PortBindingLevel(driver='openvswitch',host='compute2',level=0,port_id=ccca9701-19c0-4590-92d0-5fbd909d4eeb,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)]
 get_binding_level_objs 
/usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver 
[req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - -] host: compute2, 
agent_active_ports: 3, refresh_tunnels: True update_port_up

// rpc-3
Notify l2population agent compute2 at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', 
ip_address='192.168.1.2')], '30.0.1.7': [('00:00:00:00:00:00', '0.0.0.0'), 
PortInfo(mac_address='fa:16:3e:21:34:43', ip_address='192.168.1.11')]}}} 
_notification_host

// rpc-4
Fanout notify l2population agents at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.8': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:45:eb:6a', 
ip_address='192.168.1.141')]}}} _notification_fanout

```

1. After iter_num=0, cleanup_stale_flows clears table=21 and table=22 of stale 
openflow tables
2. If compute1 receives rpc-4 first, tunnels_missing=False
3. rpc-1 timeout not received 
4. As a result, table=22,priority=1, output is missing output="vxlan-1e000106" 
and table=21,priority=1 is missing 192.168.1.2 arp responder table
5. Missing flow tables will always be missing, resulting in VMs under this 
network not being able to communicate with VMs under the network node at layer 2

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1978088

Title:
  After ovs-agent restart, table=21 and table=22 on br-tun openflow
  table is missing

Status in neutron:
  New

Bug description:
  In the following scenarios (especially in large-scale cases, when
  restarting many ovs-agents at the same time), the openflow table is
  missing and cannot be self-recovered

  As a simple example, restarting two ovs-agent at the same time:
  ```
  network.local_ip=30.0.1.6,output="vxlan-1e000106"
  compute1.local_ip=30.0.1.7,output="vxlan-1e000107"
  compute2.local_ip=30.0.1.8,output="vxlan-1e000108"

  network.port=('192.168.1.2')
  compute1.port=('192.168.1.11')
  compute2.port=('192.168.1.141')

  
  // iter_num=0 of compute1
  DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - 
- -] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding 
levels 
[PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)]
 get_binding_level_objs 
/usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
  DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver 
[req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, 
agent_active_ports: 3, refresh_tunnels: True update_port_up

  // rpc-1
  Notify l2population agent compute1 at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', 
ip_address='192.168.1.2')], '30.0.1.8': [('00:00:00:00:00:00', '0.0.0.0'), 
PortInfo(mac_address='fa:16:3e:45:eb:6a', ip_address='192.168.1.141')]}}} 
_notification_host

  // rpc-2
  Fanout notify l2population agents at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.7': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:21:34:43', 
ip_address='192.168.1.11')]}}} _notification_fanout

  // iter_num>0 of compute1
  DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - 
- -] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding 
levels 
[PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)]
 get_binding_level_objs 
/usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
  2022-06-09 17:45:39.546 833566 DEBUG 
neutron.plugins.ml2.drivers.l2pop.mech_driver 
[req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, 
agent_active_ports: 3, refresh_tunnels: False update_port_up 

  ...

  
  // iter_num=0 of compute2
  DEBUG neutron.plugins.ml2.db [req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - 
- -] For port ccca9701-19c0-4590-92d0-5fbd909d4eeb, host compute2, got binding 
levels 
[PortBindingLevel(driver='openvswitch',host='compute2',level=0,port_id=ccca9701-19c0-4590-92d0-5fbd909d4eeb,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)]
 get_binding_level_objs 
/usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
  DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver 
[req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - -] host: compute2, 
agent_active_ports: 3, refresh_tunnels: True update_port_up

  // rpc-3
  Notify l2population agent compute2 at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', 
ip_address='192.168.1.2')], '30.0.1.7': [('00:00:00:00:00:00', '0.0.0.0'), 
PortInfo(mac_address='fa:16:3e:21:34:43', ip_address='192.168.1.11')]}}} 
_notification_host

  // rpc-4
  Fanout notify l2population agents at q-agent-notifier the message 
add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 
22, 'network_type': 'vxlan', 'ports': {'30.0.1.8': [('00:00:00:00:00:00', 
'0.0.0.0'), PortInfo(mac_address='fa:16:3e:45:eb:6a', 
ip_address='192.168.1.141')]}}} _notification_fanout

  ```

  1. After iter_num=0, cleanup_stale_flows clears table=21 and table=22 of 
stale openflow tables
  2. If compute1 receives rpc-4 first, tunnels_missing=False
  3. rpc-1 timeout not received 
  4. As a result, table=22,priority=1, output is missing 
output="vxlan-1e000106" and table=21,priority=1 is missing 192.168.1.2 arp 
responder table
  5. Missing flow tables will always be missing, resulting in VMs under this 
network not being able to communicate with VMs under the network node at layer 2

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1978088/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to