Public bug reported:

This happened in gate:
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1df/927631/1/check/neutron-
ovn-grenade-multinode/1dfe841/testr_results.html

>From tempest perspective, the failure looks like neutron API timed out.

When I inspect one of the failed requests, (both create_routers), I can
see that the request is stuck for more than a minute, then returns 201,
but it's too late for tempest.

In neutron-server log, we see that a router port is created:

```
Aug 30 18:47:37.411410 np0038347053 neutron-server[151218]: DEBUG 
neutron.scheduler.l3_ovn_scheduler [None 
req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Gateway 
lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab scheduled on chassis 
['9cda8b5d-48b4-470c-9200-595fc41cc4b0'] {{(pid=151218) _schedule_gateway 
/opt/stack/new/neutron/neutron/scheduler/l3_ovn_scheduler.py:90}}
Aug 30 18:47:37.411841 np0038347053 neutron-server[151218]: INFO 
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.commands [None 
req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Schedule LRP 
lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab on gateway 
9cda8b5d-48b4-470c-9200-595fc41cc4b0 with priority 1
```

ovn-controller claims the port, as expected:


```
2024-08-30T18:47:37.424Z|00124|binding|INFO|Claiming lport 
cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab for this chassis.
2024-08-30T18:47:37.424Z|00125|binding|INFO|cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab:
 Claiming fa:16:3e:6c:61:ec 172.24.5.195/24 2001:db8::378/64
```

neutron-server then receives the OVN event about the claim:

```
Aug 30 18:47:37.446442 np0038347053 neutron-server[151218]: DEBUG 
ovsdbapp.backend.ovs_idl.event [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae 
None None] Matched UPDATE: LogicalSwitchPortUpdateUpEvent(events=('update',), 
table='Logical_Switch_Port', conditions=None, old_conditions=None), priority=20 
to row=Logical_Switch_Port(up=[True], 
name=d368456e-fb18-44b6-8b1b-bfe5b17efcab, external_ids={'neutron:cidrs': 
'172.24.5.195/24 2001:db8::378/64', 'neutron:device_id': 
'7094f59f-6a10-4481-aaa8-55d8fb688b35', 'neutron:device_owner': 
'network:router_gateway', 'neutron:mtu': '', 'neutron:network_name': 
'neutron-90a590ed-ede0-48d0-92d2-786c7f66b733', 'neutron:port_capabilities': 
'', 'neutron:port_name': '', 'neutron:project_id': '', 
'neutron:revision_number': '1', 'neutron:security_group_ids': '', 
'neutron:subnet_pool_addr_scope4': '', 'neutron:subnet_pool_addr_scope6': '', 
'neutron:vnic_type': 'normal'}, ha_chassis_group=[], 
options={'exclude-lb-vips-from-garp': 'true', 'nat-addresses': 'r
 outer', 'router-port': 'lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab'}, 
parent_name=[], enabled=[True], type=router, dhcpv6_options=[], tag_request=[], 
port_security=[], addresses=['router'], dynamic_addresses=[], 
dhcpv4_options=[], tag=[]) old=Logical_Switch_Port(up=[False]) {{(pid=151218) 
matches 
/usr/local/lib/python3.10/dist-packages/ovsdbapp/backend/ovs_idl/event.py:55}}
Aug 30 18:47:37.447796 np0038347053 neutron-server[151218]: INFO 
neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [None 
req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] OVN reports status up for 
port: d368456e-fb18-44b6-8b1b-bfe5b17efcab
```

It proceeds with provisioning.

```
Aug 30 18:47:37.457253 np0038347053 neutron-server[151218]: DEBUG 
neutron.db.provisioning_blocks [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 
None None] Provisioning complete for port d368456e-fb18-44b6-8b1b-bfe5b17efcab 
triggered by entity L2. {{(pid=151218) provisioning_complete 
/opt/stack/new/neutron/neutron/db/provisioning_blocks.py:144}}
...
Aug 30 18:47:37.457763 np0038347053 neutron-server[151218]: INFO 
neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None 
None] Attempt 1 to provision port d368456e-fb18-44b6-8b1b-bfe5b17efcab
```

Looking at the code, the first message logs when provisioning_complete
is executed. But: if a block would be actually removed from db, we would
also see:


```
        if remove_provisioning_component(context, object_id, object_type,
                                         entity, standard_attr_id):
            LOG.debug("Provisioning for %(otype)s %(oid)s completed by entity "
                      "%(entity)s.", log_dict)

```

And we don't. Was the provisioning object not in the database?..


Then later, it fails to switch the port in neutron db to ACTIVE because:

```
Aug 30 18:47:37.499272 np0038347053 neutron-server[151218]: DEBUG 
neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None 
None] Port d368456e-fb18-44b6-8b1b-bfe5b17efcab cannot update to ACTIVE because 
it is not bound. {{(pid=151218) _port_provisioned 
/opt/stack/new/neutron/neutron/plugins/ml2/plugin.py:375}}
```

I think this is fine, since it's a router port. The provisioning hook -
_port_provisioned - then exits (since it's not a compute owned port).


Later, the driver complains that the port will be blocked by L2 block:

```
Aug 30 18:47:37.771402 np0038347053 neutron-server[151219]: DEBUG 
neutron.db.provisioning_blocks [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Transition to ACTIVE for port object 
d368456e-fb18-44b6-8b1b-bfe5b17efcab will not be triggered until provisioned by 
entity L2. {{(pid=151219) add_provisioning_component 
/opt/stack/new/neutron/neutron/db/provisioning_blocks.py:74}}
```

This message comes from add_provisioning_component. Why is the block
being added right when we were considering removing it because of l2
agent claim?

The message just above suggests it's from an update call on the port:

```
Aug 30 18:47:37.763123 np0038347053 neutron-server[151219]: DEBUG 
neutron_lib.callbacks.manager [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Publish callbacks 
['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-263536017']
 for port (d368456e-fb18-44b6-8b1b-bfe5b17efcab), precommit_update 
{{(pid=151219) _notify_loop 
/usr/local/lib/python3.10/dist-packages/neutron_lib/callbacks/manager.py:184}}
```

But I couldn't figure out immediately which update it was.


A bit more later:

```
Aug 30 18:47:52.291827 np0038347053 neutron-server[151218]: WARNING 
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [None 
req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] No hosting information 
found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab: RuntimeError: No hosting 
information found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab
```

Eventually, almost a minute since the requests started:

```
Aug 30 18:48:35.235725 np0038347053 neutron-server[151218]: INFO 
neutron.db.ovn_revision_numbers_db [None 
req-95b12168-c978-4195-b1cf-658e50efc7bd admin admin] Successfully bumped 
revision number for resource d368456e-fb18-44b6-8b1b-bfe5b17efcab (type: 
router_ports) to 1
...
Aug 30 18:48:36.048995 np0038347053 neutron-server[151219]: DEBUG 
neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab on host 
np0038347053 for vnic_type normal with profile  {{(pid=151219) bind_port 
/opt/stack/new/neutron/neutron/plugins/ml2/managers.py:810}}
Aug 30 18:48:36.049143 np0038347053 neutron-server[151219]: DEBUG 
neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab by 
drivers ovn,logger on host np0038347053 at level 0 using segments [{'id': 
'0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 
'physical_network': 'public', 'segmentation_id': None, 'network_id': 
'90a590ed-ede0-48d0-92d2-786c7f66b733'}] {{(pid=151219) _bind_port_level 
/opt/stack/new/neutron/neutron/plugins/ml2/managers.py:835}}
...
Aug 30 18:48:36.052615 np0038347053 neutron-server[151219]: DEBUG 
neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Bound port: d368456e-fb18-44b6-8b1b-bfe5b17efcab, host: 
np0038347053, vif_type: ovs, vif_details: {"port_filter": true, "connectivity": 
"l2", "bridge_name": "br-int", "datapath_type": "system"}, binding_levels: 
[{'bound_driver': 'ovn', 'bound_segment': {'id': 
'0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 
'physical_network': 'public', 'segmentation_id': None, 'network_id': 
'90a590ed-ede0-48d0-92d2-786c7f66b733'}}] {{(pid=151219) _bind_port_level 
/opt/stack/new/neutron/neutron/plugins/ml2/managers.py:947}}
```


---


This happened in grenade job but I don't think it's grenade specific.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: gate-failure ovn

** Tags added: ovn

** Tags added: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2078846

Title:
  OVN fails to transition router port to ACTIVE, claims L2 provisioning
  block is still present

Status in neutron:
  New

Bug description:
  This happened in gate:
  
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1df/927631/1/check/neutron-
  ovn-grenade-multinode/1dfe841/testr_results.html

  From tempest perspective, the failure looks like neutron API timed
  out.

  When I inspect one of the failed requests, (both create_routers), I
  can see that the request is stuck for more than a minute, then returns
  201, but it's too late for tempest.

  In neutron-server log, we see that a router port is created:

  ```
  Aug 30 18:47:37.411410 np0038347053 neutron-server[151218]: DEBUG 
neutron.scheduler.l3_ovn_scheduler [None 
req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Gateway 
lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab scheduled on chassis 
['9cda8b5d-48b4-470c-9200-595fc41cc4b0'] {{(pid=151218) _schedule_gateway 
/opt/stack/new/neutron/neutron/scheduler/l3_ovn_scheduler.py:90}}
  Aug 30 18:47:37.411841 np0038347053 neutron-server[151218]: INFO 
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.commands [None 
req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Schedule LRP 
lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab on gateway 
9cda8b5d-48b4-470c-9200-595fc41cc4b0 with priority 1
  ```

  ovn-controller claims the port, as expected:

  
  ```
  2024-08-30T18:47:37.424Z|00124|binding|INFO|Claiming lport 
cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab for this chassis.
  
2024-08-30T18:47:37.424Z|00125|binding|INFO|cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab:
 Claiming fa:16:3e:6c:61:ec 172.24.5.195/24 2001:db8::378/64
  ```

  neutron-server then receives the OVN event about the claim:

  ```
  Aug 30 18:47:37.446442 np0038347053 neutron-server[151218]: DEBUG 
ovsdbapp.backend.ovs_idl.event [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae 
None None] Matched UPDATE: LogicalSwitchPortUpdateUpEvent(events=('update',), 
table='Logical_Switch_Port', conditions=None, old_conditions=None), priority=20 
to row=Logical_Switch_Port(up=[True], 
name=d368456e-fb18-44b6-8b1b-bfe5b17efcab, external_ids={'neutron:cidrs': 
'172.24.5.195/24 2001:db8::378/64', 'neutron:device_id': 
'7094f59f-6a10-4481-aaa8-55d8fb688b35', 'neutron:device_owner': 
'network:router_gateway', 'neutron:mtu': '', 'neutron:network_name': 
'neutron-90a590ed-ede0-48d0-92d2-786c7f66b733', 'neutron:port_capabilities': 
'', 'neutron:port_name': '', 'neutron:project_id': '', 
'neutron:revision_number': '1', 'neutron:security_group_ids': '', 
'neutron:subnet_pool_addr_scope4': '', 'neutron:subnet_pool_addr_scope6': '', 
'neutron:vnic_type': 'normal'}, ha_chassis_group=[], 
options={'exclude-lb-vips-from-garp': 'true', 'nat-addresses': 
 'router', 'router-port': 'lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab'}, 
parent_name=[], enabled=[True], type=router, dhcpv6_options=[], tag_request=[], 
port_security=[], addresses=['router'], dynamic_addresses=[], 
dhcpv4_options=[], tag=[]) old=Logical_Switch_Port(up=[False]) {{(pid=151218) 
matches 
/usr/local/lib/python3.10/dist-packages/ovsdbapp/backend/ovs_idl/event.py:55}}
  Aug 30 18:47:37.447796 np0038347053 neutron-server[151218]: INFO 
neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [None 
req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] OVN reports status up for 
port: d368456e-fb18-44b6-8b1b-bfe5b17efcab
  ```

  It proceeds with provisioning.

  ```
  Aug 30 18:47:37.457253 np0038347053 neutron-server[151218]: DEBUG 
neutron.db.provisioning_blocks [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 
None None] Provisioning complete for port d368456e-fb18-44b6-8b1b-bfe5b17efcab 
triggered by entity L2. {{(pid=151218) provisioning_complete 
/opt/stack/new/neutron/neutron/db/provisioning_blocks.py:144}}
  ...
  Aug 30 18:47:37.457763 np0038347053 neutron-server[151218]: INFO 
neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None 
None] Attempt 1 to provision port d368456e-fb18-44b6-8b1b-bfe5b17efcab
  ```

  Looking at the code, the first message logs when provisioning_complete
  is executed. But: if a block would be actually removed from db, we
  would also see:

  
  ```
          if remove_provisioning_component(context, object_id, object_type,
                                           entity, standard_attr_id):
              LOG.debug("Provisioning for %(otype)s %(oid)s completed by entity 
"
                        "%(entity)s.", log_dict)

  ```

  And we don't. Was the provisioning object not in the database?..

  
  Then later, it fails to switch the port in neutron db to ACTIVE because:

  ```
  Aug 30 18:47:37.499272 np0038347053 neutron-server[151218]: DEBUG 
neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None 
None] Port d368456e-fb18-44b6-8b1b-bfe5b17efcab cannot update to ACTIVE because 
it is not bound. {{(pid=151218) _port_provisioned 
/opt/stack/new/neutron/neutron/plugins/ml2/plugin.py:375}}
  ```

  I think this is fine, since it's a router port. The provisioning hook
  - _port_provisioned - then exits (since it's not a compute owned
  port).

  
  Later, the driver complains that the port will be blocked by L2 block:

  ```
  Aug 30 18:47:37.771402 np0038347053 neutron-server[151219]: DEBUG 
neutron.db.provisioning_blocks [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Transition to ACTIVE for port object 
d368456e-fb18-44b6-8b1b-bfe5b17efcab will not be triggered until provisioned by 
entity L2. {{(pid=151219) add_provisioning_component 
/opt/stack/new/neutron/neutron/db/provisioning_blocks.py:74}}
  ```

  This message comes from add_provisioning_component. Why is the block
  being added right when we were considering removing it because of l2
  agent claim?

  The message just above suggests it's from an update call on the port:

  ```
  Aug 30 18:47:37.763123 np0038347053 neutron-server[151219]: DEBUG 
neutron_lib.callbacks.manager [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Publish callbacks 
['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-263536017']
 for port (d368456e-fb18-44b6-8b1b-bfe5b17efcab), precommit_update 
{{(pid=151219) _notify_loop 
/usr/local/lib/python3.10/dist-packages/neutron_lib/callbacks/manager.py:184}}
  ```

  But I couldn't figure out immediately which update it was.

  
  A bit more later:

  ```
  Aug 30 18:47:52.291827 np0038347053 neutron-server[151218]: WARNING 
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [None 
req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] No hosting information 
found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab: RuntimeError: No hosting 
information found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab
  ```

  Eventually, almost a minute since the requests started:

  ```
  Aug 30 18:48:35.235725 np0038347053 neutron-server[151218]: INFO 
neutron.db.ovn_revision_numbers_db [None 
req-95b12168-c978-4195-b1cf-658e50efc7bd admin admin] Successfully bumped 
revision number for resource d368456e-fb18-44b6-8b1b-bfe5b17efcab (type: 
router_ports) to 1
  ...
  Aug 30 18:48:36.048995 np0038347053 neutron-server[151219]: DEBUG 
neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab on host 
np0038347053 for vnic_type normal with profile  {{(pid=151219) bind_port 
/opt/stack/new/neutron/neutron/plugins/ml2/managers.py:810}}
  Aug 30 18:48:36.049143 np0038347053 neutron-server[151219]: DEBUG 
neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab by 
drivers ovn,logger on host np0038347053 at level 0 using segments [{'id': 
'0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 
'physical_network': 'public', 'segmentation_id': None, 'network_id': 
'90a590ed-ede0-48d0-92d2-786c7f66b733'}] {{(pid=151219) _bind_port_level 
/opt/stack/new/neutron/neutron/plugins/ml2/managers.py:835}}
  ...
  Aug 30 18:48:36.052615 np0038347053 neutron-server[151219]: DEBUG 
neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 
None None] Bound port: d368456e-fb18-44b6-8b1b-bfe5b17efcab, host: 
np0038347053, vif_type: ovs, vif_details: {"port_filter": true, "connectivity": 
"l2", "bridge_name": "br-int", "datapath_type": "system"}, binding_levels: 
[{'bound_driver': 'ovn', 'bound_segment': {'id': 
'0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 
'physical_network': 'public', 'segmentation_id': None, 'network_id': 
'90a590ed-ede0-48d0-92d2-786c7f66b733'}}] {{(pid=151219) _bind_port_level 
/opt/stack/new/neutron/neutron/plugins/ml2/managers.py:947}}
  ```

  
  ---

  
  This happened in grenade job but I don't think it's grenade specific.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2078846/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to