I ran perf record -F99 -p $(ovsdb-server) -- sleep 30 on ovsdb-server process during CPU spike. perf report result:
Samples: 2K of event 'cpu-clock', Event count (approx.): 29989898690 Overhead Command Shared Object Symbol 58.71% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] uuid_compare_3way 7.61% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011058 6.60% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_atom_compare_3way 5.93% ovsdb-server libovsdb-3.1.so.0.0.0 [.] ovsdb_condition_match_any_clause 2.26% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011070 2.19% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] uuid_compare_3way@plt 2.16% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010ffd 1.68% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010fe8 1.65% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x00000000000110ec 1.38% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011084 1.31% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001100e 1.25% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011063 1.08% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_compare_3way 1.04% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010fe0 0.67% ovsdb-server libc-2.17.so [.] _int_malloc 0.54% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001105a 0.51% ovsdb-server libovsdb-3.1.so.0.0.0 [.] ovsdb_monitor_get_update 0.47% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_hash 0.30% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] json_destroy__ 0.24% ovsdb-server libc-2.17.so [.] malloc_consolidate 0.24% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_atom_hash 0.20% ovsdb-server [kernel.kallsyms] [k] __do_softirq 0.17% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] json_string 0.13% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_destroy 0.13% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010fe5 0.10% ovsdb-server libc-2.17.so [.] __memset_sse2 0.10% ovsdb-server libc-2.17.so [.] __strlen_sse2_pminub 0.10% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010fea 0.10% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011020 0.10% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011067 0.07% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_compare_3way@plt 0.07% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_equals 0.07% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010ff1 0.07% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000011025 0.07% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001108b 0.03% ovsdb-server [kernel.kallsyms] [k] __netif_schedule 0.03% ovsdb-server libc-2.17.so [.] __strncmp_sse42 0.03% ovsdb-server libc-2.17.so [.] _int_free 0.03% ovsdb-server libc-2.17.so [.] free 0.03% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_clone 0.03% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] ovsdb_datum_from_json 0.03% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] shash_find 0.03% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] 0x000000000018f7e2 0.03% ovsdb-server libopenvswitch-3.1.so.0.0.0 [.] 0x000000000018f7e8 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] ovsdb_condition_from_json 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010fe1 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x0000000000010fff 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001105e 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x00000000000110c3 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x00000000000110f2 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001c007 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001c030 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001c05b 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001c45f 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001c535 0.03% ovsdb-server libovsdb-3.1.so.0.0.0 [.] 0x000000000001c644 > On 2 May 2023, at 19:45, Ilya Maximets via discuss > <ovs-discuss@openvswitch.org> wrote: > > On 5/2/23 16:45, Vladislav Odintsov wrote: >> Hi Ilya, >> >> let me jump into this thread. >> >> Right now I’m debugging the behaviour of ovn (22.09.x) and ovsdb-server >> 3.1.0 where one ovsdb update3 request makes ovsdb-server, which acts as a >> relay for OVN Southbound DB with only 5-6 clients connected to it >> (ovn-controllers, acting as a central chassis for external access with >> enabled ha_group for edge LRs), utilize 100% CPU during 5-6 minutes. >> During this time ovsdb relay failes to answer ovsdb inactivity probes and >> then clients and even upstream ovsdb-servers disconnect this ovsdb relay >> because of ping probe timeout of 60s. All the probe intervals configured to >> 60 seconds value (ovsdb-server SB cluster <-> ovsdb SB relay <-> >> ovn-controller). Earlier I’ve posted a long-read with some problems listed >> [1]. >> >> IIUC, this update is generated by ovn-northd after one LS with only one LSP >> type router and attached LB is removed. >> You can see the request json here: [2] >> Such updates appear not only if LS/LB is removed but also in some other >> operations, this is just an example. >> So it seems like ovn-northd re-creates a big dp group and such update for >> some reason is difficult to handle for ovsdb relay (actually ovn-controllers >> also utilize 100% cpu). >> >> Have you seen such behaviour? Maybe you’ve got any suggestion about the >> reason and a possible fix for such huge load from one update3? > > Such big updates are typically triggered by the fact that > northd doesn't modify datapath groups. It re-creates them > instead. I'm actually working on a patch for this since > last week and hope to post it before the soft freeze for 23.06. > > In your case, however, it seems like you have a new datapath > group created for a load balancer. And it looks like you have > about 1000 switches in the setup, which is a lot, but... > > Looking at the dump [2], it's big for a human, but it's only > about 100 KB in size. Database shouldn't have a lot of trouble > processing that, especially in 3.1. So, I'm not sure what > exactly is going on in your case. If you can profile the relay > server and see what it is doing all that time, that would be > helpful. > > Best regards, Ilya Maximets. > >> >> Thanks. >> >> 1: https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403699.html >> 2: https://gist.github.com/odivlad/bba4443e589a268a0f389c2972511df3 >> >> >>> On 2 May 2023, at 14:49, Ilya Maximets via discuss >>> <ovs-discuss@openvswitch.org> wrote: >>> >>> Form my side, the first option would be increasing the inactivity >>> probe on the ovn-controller side and see if that resolves the issue. >>> Deployments typically have 60+ seconds set, just in case. >>> >>> Also, if you're not already using latest versions of OVS/OVN, upgrade >>> may resolve the issue as well. For example, OVS 2.17 provides a big >>> performance improvement over previous versions and 3.0 and 3.1 give >>> even more on top. And with new OVN releases, the southbound database >>> size usually goes down significantly reducing the load on the OVSDB >>> server. I'd suggest to use releases after OVN 22.09 for large scale >>> deployments. >>> >>> However, if your setup have only one switch with 250 ports and you >>> have an issue, that should not really be related to scale and you >>> need to investigate further on what exactly is happening. >>> >>> Best regards, Ilya Maximets. >>> >>> On 5/2/23 08:58, Felix Hüttner via discuss wrote: >>>> Hi Gavin, >>>> >>>> we saw similar issues after reaching a certain number of hypervisors. This >>>> happened because our ovsdb processes ran at 100% cpu utilization (and they >>>> are not multithreaded). >>>> >>>> Our solutions where: >>>> >>>> 1. If you use ssl on your north-/southbound db. Disable it and add a tls >>>> terminating reverse proxy (like traefik) in front >>>> 2. Increase the inactivity probe significantly (you might need to change >>>> it on the ovn-controller and ovsdb side, not sure anymore) >>>> 3. Introduce ovsdb relays and connect the ovn-controllers there. >>>> >>>> -- >>>> >>>> Felix Huettner >>>> >>>> >>>> >>>> *From:* discuss <ovs-discuss-boun...@openvswitch.org> *On Behalf Of *Gavin >>>> McKee via discuss >>>> *Sent:* Monday, May 1, 2023 9:20 PM >>>> *To:* ovs-discuss <ovs-discuss@openvswitch.org> >>>> *Subject:* [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db >>>> unstable >>>> >>>> Hi , >>>> >>>> I'm having a pretty bad issue with OVN controller on the hypervisors being >>>> unable to connect to the OVS SB DB , >>>> >>>> >>>> >>>> 2023-05-01T19:13:33.969Z|00541|reconnect|ERR|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 >>>> seconds, disconnecting >>>> 2023-05-01T19:13:33.969Z|00542|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connection dropped >>>> 2023-05-01T19:13:43.043Z|00543|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connected >>>> 2023-05-01T19:13:56.115Z|00544|reconnect|ERR|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 >>>> seconds, disconnecting >>>> 2023-05-01T19:13:56.115Z|00545|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connection dropped >>>> 2023-05-01T19:14:36.177Z|00546|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connected >>>> 2023-05-01T19:14:44.996Z|00547|jsonrpc|WARN|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: receive error: Connection reset by peer >>>> 2023-05-01T19:14:44.996Z|00548|reconnect|WARN|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connection dropped (Connection reset by peer) >>>> 2023-05-01T19:15:44.131Z|00549|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connected >>>> 2023-05-01T19:15:54.137Z|00550|reconnect|ERR|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 >>>> seconds, disconnecting >>>> 2023-05-01T19:15:54.137Z|00551|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connection dropped >>>> 2023-05-01T19:16:02.184Z|00552|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connected >>>> 2023-05-01T19:16:14.488Z|00553|reconnect|ERR|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 >>>> seconds, disconnecting >>>> 2023-05-01T19:16:14.488Z|00554|reconnect|INFO|tcp:10.193.1.2:6642 >>>> <http://10.193.1.2:6642/>: connection dropped >>>> >>>> >>>> >>>> This happened after pushing a configuration to north db for around 250 >>>> logical switch ports. >>>> >>>> Once I turn on the VM's everything goes bad very quickly, >>>> >>>> >>>> >>>> >>>> >>>> 2023-05-01T04:27:09.294Z|01947|poll_loop|INFO|wakeup due to [POLLOUT] on >>>> fd 66 (10.193.200.6:6642 <http://10.193.200.6:6642/><->10.193.0.102:48794 >>>> <http://10.193.0.102:48794/>) at ../lib/stream-fd.c:153 (100% CPU usage) >>>> >>>> >>>> >>>> Can anyone provide any guidance how to run down an issue like this ? >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >> >> >> >> >> Regards, >> Vladislav Odintsov >> > > _______________________________________________ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss Regards, Vladislav Odintsov
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss