Hi Ilya, let me jump into this thread.
Right now I’m debugging the behaviour of ovn (22.09.x) and ovsdb-server 3.1.0 where one ovsdb update3 request makes ovsdb-server, which acts as a relay for OVN Southbound DB with only 5-6 clients connected to it (ovn-controllers, acting as a central chassis for external access with enabled ha_group for edge LRs), utilize 100% CPU during 5-6 minutes. During this time ovsdb relay failes to answer ovsdb inactivity probes and then clients and even upstream ovsdb-servers disconnect this ovsdb relay because of ping probe timeout of 60s. All the probe intervals configured to 60 seconds value (ovsdb-server SB cluster <-> ovsdb SB relay <-> ovn-controller). Earlier I’ve posted a long-read with some problems listed [1]. IIUC, this update is generated by ovn-northd after one LS with only one LSP type router and attached LB is removed. You can see the request json here: [2] Such updates appear not only if LS/LB is removed but also in some other operations, this is just an example. So it seems like ovn-northd re-creates a big dp group and such update for some reason is difficult to handle for ovsdb relay (actually ovn-controllers also utilize 100% cpu). Have you seen such behaviour? Maybe you’ve got any suggestion about the reason and a possible fix for such huge load from one update3? Thanks. 1: https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403699.html 2: https://gist.github.com/odivlad/bba4443e589a268a0f389c2972511df3 > On 2 May 2023, at 14:49, Ilya Maximets via discuss > <ovs-discuss@openvswitch.org> wrote: > > Form my side, the first option would be increasing the inactivity > probe on the ovn-controller side and see if that resolves the issue. > Deployments typically have 60+ seconds set, just in case. > > Also, if you're not already using latest versions of OVS/OVN, upgrade > may resolve the issue as well. For example, OVS 2.17 provides a big > performance improvement over previous versions and 3.0 and 3.1 give > even more on top. And with new OVN releases, the southbound database > size usually goes down significantly reducing the load on the OVSDB > server. I'd suggest to use releases after OVN 22.09 for large scale > deployments. > > However, if your setup have only one switch with 250 ports and you > have an issue, that should not really be related to scale and you > need to investigate further on what exactly is happening. > > Best regards, Ilya Maximets. > > On 5/2/23 08:58, Felix Hüttner via discuss wrote: >> Hi Gavin, >> >> we saw similar issues after reaching a certain number of hypervisors. This >> happened because our ovsdb processes ran at 100% cpu utilization (and they >> are not multithreaded). >> >> Our solutions where: >> >> 1. If you use ssl on your north-/southbound db. Disable it and add a tls >> terminating reverse proxy (like traefik) in front >> 2. Increase the inactivity probe significantly (you might need to change it >> on the ovn-controller and ovsdb side, not sure anymore) >> 3. Introduce ovsdb relays and connect the ovn-controllers there. >> >> -- >> >> Felix Huettner >> >> >> >> *From:* discuss <ovs-discuss-boun...@openvswitch.org> *On Behalf Of *Gavin >> McKee via discuss >> *Sent:* Monday, May 1, 2023 9:20 PM >> *To:* ovs-discuss <ovs-discuss@openvswitch.org> >> *Subject:* [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db >> unstable >> >> Hi , >> >> I'm having a pretty bad issue with OVN controller on the hypervisors being >> unable to connect to the OVS SB DB , >> >> >> >> 2023-05-01T19:13:33.969Z|00541|reconnect|ERR|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, >> disconnecting >> 2023-05-01T19:13:33.969Z|00542|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connection dropped >> 2023-05-01T19:13:43.043Z|00543|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connected >> 2023-05-01T19:13:56.115Z|00544|reconnect|ERR|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, >> disconnecting >> 2023-05-01T19:13:56.115Z|00545|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connection dropped >> 2023-05-01T19:14:36.177Z|00546|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connected >> 2023-05-01T19:14:44.996Z|00547|jsonrpc|WARN|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: receive error: Connection reset by peer >> 2023-05-01T19:14:44.996Z|00548|reconnect|WARN|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connection dropped (Connection reset by peer) >> 2023-05-01T19:15:44.131Z|00549|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connected >> 2023-05-01T19:15:54.137Z|00550|reconnect|ERR|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, >> disconnecting >> 2023-05-01T19:15:54.137Z|00551|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connection dropped >> 2023-05-01T19:16:02.184Z|00552|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connected >> 2023-05-01T19:16:14.488Z|00553|reconnect|ERR|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, >> disconnecting >> 2023-05-01T19:16:14.488Z|00554|reconnect|INFO|tcp:10.193.1.2:6642 >> <http://10.193.1.2:6642/>: connection dropped >> >> >> >> This happened after pushing a configuration to north db for around 250 >> logical switch ports. >> >> Once I turn on the VM's everything goes bad very quickly, >> >> >> >> >> >> 2023-05-01T04:27:09.294Z|01947|poll_loop|INFO|wakeup due to [POLLOUT] on fd >> 66 (10.193.200.6:6642 <http://10.193.200.6:6642/><->10.193.0.102:48794 >> <http://10.193.0.102:48794/>) at ../lib/stream-fd.c:153 (100% CPU usage) >> >> >> >> Can anyone provide any guidance how to run down an issue like this ? > > _______________________________________________ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss Regards, Vladislav Odintsov
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss