Public bug reported: we have seen random failures of
test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume] in the nova-live-migaration job with the following error Details: {'code': 400, 'message': 'Migration pre-check error: Binding failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check neutron logs for more information.'} looking at the neuton log we see May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb- ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Refusing to bind port e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent: <neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent object at 0x7f6a7a6d2950> May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for vnic_type normal using segments [{'id': '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 525, 'network_id': '745f0724-2779-4d60-845c-8f673d567d0d'}] and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too. May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]: DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis table for 10 seconds {{(pid=38857) run /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}} This looks like it might be related to https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e This modified the code to add some randomness due to https://bugs.launchpad.net/neutron/+bug/1991817 but that seams to negitivly impact the stability of the agent. to fix this i will propose a patch to change the interval form interval = randint(0, cfg.CONF.agent_down_time // 2) to interval = randint(0, cfg.CONF.agent_down_time // 3) to increase the likelihood that we send the heartbeat in time. when we are making calls to privsep and ovs the logs stop for multiple second while those operations are happening and if that happens the the wrong time i belive this leads to use missing the heartbeat interval. ** Affects: neutron Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: New ** Changed in: neutron Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2020215 Title: ml2/ovn refuses to bind port due to dead agent randomly in the nova- live-migrate ci job Status in neutron: New Bug description: we have seen random failures of test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume] in the nova-live-migaration job with the following error Details: {'code': 400, 'message': 'Migration pre-check error: Binding failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check neutron logs for more information.'} looking at the neuton log we see May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb- ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Refusing to bind port e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent: <neutron.plugins.ml2.drivers.ovn.agent.neutron_agent.ControllerAgent object at 0x7f6a7a6d2950> May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for vnic_type normal using segments [{'id': '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 525, 'network_id': '745f0724-2779-4d60-845c-8f673d567d0d'}] and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too. May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]: DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis table for 10 seconds {{(pid=38857) run /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}} This looks like it might be related to https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e This modified the code to add some randomness due to https://bugs.launchpad.net/neutron/+bug/1991817 but that seams to negitivly impact the stability of the agent. to fix this i will propose a patch to change the interval form interval = randint(0, cfg.CONF.agent_down_time // 2) to interval = randint(0, cfg.CONF.agent_down_time // 3) to increase the likelihood that we send the heartbeat in time. when we are making calls to privsep and ovs the logs stop for multiple second while those operations are happening and if that happens the the wrong time i belive this leads to use missing the heartbeat interval. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2020215/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp