Public bug reported: Queens, ovsdb native interface.
On a loaded gtw node hosting > 1000 ports when restarting neutron- openvswitch-agent at some moment agent stops sending state reports and do any logging for a significant time, depending on number of ports. In our case gtw node hosts > 1400 ports and agent hangs for ~100 seconds. Thus if configured agent_down_time is less that 100 seconds, neutron server sees agent as down, starts resources rescheduling. After agent stops hanging it sees itself as "revived" and starts new full sync. This loop is almost endless. Debug showed the culprit is process_trusted_ports: https://github.com/openstack/neutron/blob/13.0.4/neutron/agent/linux/openvswitch_firewall/firewall.py#L655 - this func does not yield control to other greenthreads and blocks until all trusted ports are processed. Since on gateway nodes almost al ports are "trusted" (router and dhcp ports) process_trusted_ports may take significant time. The proposal would be to add greenlet.sleep(0) inside loop in process_trusted_ports - that fixed the issue on our environment. ** Affects: neutron Importance: High Assignee: Oleg Bondarev (obondarev) Status: In Progress ** Tags: ovs-fw -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1836023 Title: OVS agent "hangs" while processing trusted ports Status in neutron: In Progress Bug description: Queens, ovsdb native interface. On a loaded gtw node hosting > 1000 ports when restarting neutron- openvswitch-agent at some moment agent stops sending state reports and do any logging for a significant time, depending on number of ports. In our case gtw node hosts > 1400 ports and agent hangs for ~100 seconds. Thus if configured agent_down_time is less that 100 seconds, neutron server sees agent as down, starts resources rescheduling. After agent stops hanging it sees itself as "revived" and starts new full sync. This loop is almost endless. Debug showed the culprit is process_trusted_ports: https://github.com/openstack/neutron/blob/13.0.4/neutron/agent/linux/openvswitch_firewall/firewall.py#L655 - this func does not yield control to other greenthreads and blocks until all trusted ports are processed. Since on gateway nodes almost al ports are "trusted" (router and dhcp ports) process_trusted_ports may take significant time. The proposal would be to add greenlet.sleep(0) inside loop in process_trusted_ports - that fixed the issue on our environment. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1836023/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp