Re: [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db unstable

Vladislav Odintsov via discuss Tue, 02 May 2023 10:05:23 -0700

I ran perf record -F99 -p $(ovsdb-server) -- sleep 30 on ovsdb-server process 
during CPU spike. perf report result:


Samples: 2K of event 'cpu-clock', Event count (approx.): 29989898690
Overhead  Command       Shared Object                Symbol
  58.71%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] uuid_compare_3way
   7.61%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011058
   6.60%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_atom_compare_3way
   5.93%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 
ovsdb_condition_match_any_clause
   2.26%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011070
   2.19%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] uuid_compare_3way@plt
   2.16%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010ffd
   1.68%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe8
   1.65%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x00000000000110ec
   1.38%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011084
   1.31%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001100e
   1.25%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011063
   1.08%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
ovsdb_datum_compare_3way
   1.04%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe0
   0.67%  ovsdb-server  libc-2.17.so                 [.] _int_malloc
   0.54%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001105a
   0.51%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 
ovsdb_monitor_get_update
   0.47%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_hash
   0.30%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] json_destroy__
   0.24%  ovsdb-server  libc-2.17.so                 [.] malloc_consolidate
   0.24%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_atom_hash
   0.20%  ovsdb-server  [kernel.kallsyms]            [k] __do_softirq
   0.17%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] json_string
   0.13%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_destroy
   0.13%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe5
   0.10%  ovsdb-server  libc-2.17.so                 [.] __memset_sse2
   0.10%  ovsdb-server  libc-2.17.so                 [.] __strlen_sse2_pminub
   0.10%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fea
   0.10%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011020
   0.10%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011067
   0.07%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 
ovsdb_datum_compare_3way@plt
   0.07%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_equals
   0.07%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010ff1
   0.07%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000011025
   0.07%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001108b
   0.03%  ovsdb-server  [kernel.kallsyms]            [k] __netif_schedule
   0.03%  ovsdb-server  libc-2.17.so                 [.] __strncmp_sse42
   0.03%  ovsdb-server  libc-2.17.so                 [.] _int_free
   0.03%  ovsdb-server  libc-2.17.so                 [.] free
   0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_clone
   0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] ovsdb_datum_from_json
   0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] shash_find
   0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 0x000000000018f7e2
   0.03%  ovsdb-server  libopenvswitch-3.1.so.0.0.0  [.] 0x000000000018f7e8
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 
ovsdb_condition_from_json
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fe1
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x0000000000010fff
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001105e
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x00000000000110c3
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x00000000000110f2
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c007
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c030
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c05b
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c45f
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c535
   0.03%  ovsdb-server  libovsdb-3.1.so.0.0.0        [.] 0x000000000001c644

> On 2 May 2023, at 19:45, Ilya Maximets via discuss 
> <ovs-discuss@openvswitch.org> wrote:
> 
> On 5/2/23 16:45, Vladislav Odintsov wrote:
>> Hi Ilya,
>> 
>> let me jump into this thread.
>> 
>> Right now I’m debugging the behaviour of ovn (22.09.x) and ovsdb-server 
>> 3.1.0 where one ovsdb update3 request makes ovsdb-server, which acts as a 
>> relay for OVN Southbound DB with only 5-6 clients connected to it 
>> (ovn-controllers, acting as a central chassis for external access with 
>> enabled ha_group for edge LRs), utilize 100% CPU during 5-6 minutes.
>> During this time ovsdb relay failes to answer ovsdb inactivity probes and 
>> then clients and even upstream ovsdb-servers disconnect this ovsdb relay 
>> because of ping probe timeout of 60s. All the probe intervals configured to 
>> 60 seconds value (ovsdb-server SB cluster <-> ovsdb SB relay <-> 
>> ovn-controller). Earlier I’ve posted a long-read with some problems listed 
>> [1].
>> 
>> IIUC, this update is generated by ovn-northd after one LS with only one LSP 
>> type router and attached LB is removed.
>> You can see the request json here: [2]
>> Such updates appear not only if LS/LB is removed but also in some other 
>> operations, this is just an example.
>> So it seems like ovn-northd re-creates a big dp group and such update for 
>> some reason is difficult to handle for ovsdb relay (actually ovn-controllers 
>>  also utilize 100% cpu).
>> 
>> Have you seen such behaviour? Maybe you’ve got any suggestion about the 
>> reason and a possible fix for such huge load from one update3?
> 
> Such big updates are typically triggered by the fact that
> northd doesn't modify datapath groups.  It re-creates them
> instead.  I'm actually working on a patch for this since
> last week and hope to post it before the soft freeze for 23.06.
> 
> In your case, however, it seems like you have a new datapath
> group created for a load balancer.  And it looks like you have
> about 1000 switches in the setup, which is a lot, but...
> 
> Looking at the dump [2], it's big for a human, but it's only
> about 100 KB in size.  Database shouldn't have a lot of trouble
> processing that, especially in 3.1.  So, I'm not sure what
> exactly is going on in your case.  If you can profile the relay
> server and see what it is doing all that time, that would be
> helpful.
> 
> Best regards, Ilya Maximets.
> 
>> 
>> Thanks.
>> 
>> 1: https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403699.html
>> 2: https://gist.github.com/odivlad/bba4443e589a268a0f389c2972511df3
>> 
>> 
>>> On 2 May 2023, at 14:49, Ilya Maximets via discuss 
>>> <ovs-discuss@openvswitch.org> wrote:
>>> 
>>> Form my side, the first option would be increasing the inactivity
>>> probe on the ovn-controller side and see if that resolves the issue.
>>> Deployments typically have 60+ seconds set, just in case.
>>> 
>>> Also, if you're not already using latest versions of OVS/OVN, upgrade
>>> may resolve the issue as well.  For example, OVS 2.17 provides a big
>>> performance improvement over previous versions and 3.0 and 3.1 give
>>> even more on top.  And with new OVN releases, the southbound database
>>> size usually goes down significantly reducing the load on the OVSDB
>>> server.  I'd suggest to use releases after OVN 22.09 for large scale
>>> deployments.
>>> 
>>> However, if your setup have only one switch with 250 ports and you
>>> have an issue, that should not really be related to scale and you
>>> need to investigate further on what exactly is happening.
>>> 
>>> Best regards, Ilya Maximets.
>>> 
>>> On 5/2/23 08:58, Felix Hüttner via discuss wrote:
>>>> Hi Gavin,
>>>> 
>>>> we saw similar issues after reaching a certain number of hypervisors. This 
>>>> happened because our ovsdb processes ran at 100% cpu utilization (and they 
>>>> are not multithreaded).
>>>> 
>>>> Our solutions where:
>>>> 
>>>> 1. If you use ssl on your north-/southbound db. Disable it and add a tls 
>>>> terminating reverse proxy (like traefik) in front
>>>> 2. Increase the inactivity probe significantly (you might need to change 
>>>> it on the ovn-controller and ovsdb side, not sure anymore)
>>>> 3. Introduce ovsdb relays and connect the ovn-controllers there.
>>>> 
>>>> --
>>>> 
>>>> Felix Huettner
>>>> 
>>>>  
>>>> 
>>>> *From:* discuss <ovs-discuss-boun...@openvswitch.org> *On Behalf Of *Gavin 
>>>> McKee via discuss
>>>> *Sent:* Monday, May 1, 2023 9:20 PM
>>>> *To:* ovs-discuss <ovs-discuss@openvswitch.org>
>>>> *Subject:* [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db 
>>>> unstable
>>>> 
>>>> Hi ,
>>>> 
>>>> I'm having a pretty bad issue with OVN controller on the hypervisors being 
>>>> unable to connect to the OVS SB DB ,
>>>> 
>>>>  
>>>> 
>>>> 2023-05-01T19:13:33.969Z|00541|reconnect|ERR|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>> seconds, disconnecting
>>>> 2023-05-01T19:13:33.969Z|00542|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connection dropped
>>>> 2023-05-01T19:13:43.043Z|00543|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connected
>>>> 2023-05-01T19:13:56.115Z|00544|reconnect|ERR|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>> seconds, disconnecting
>>>> 2023-05-01T19:13:56.115Z|00545|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connection dropped
>>>> 2023-05-01T19:14:36.177Z|00546|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connected
>>>> 2023-05-01T19:14:44.996Z|00547|jsonrpc|WARN|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: receive error: Connection reset by peer
>>>> 2023-05-01T19:14:44.996Z|00548|reconnect|WARN|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connection dropped (Connection reset by peer)
>>>> 2023-05-01T19:15:44.131Z|00549|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connected
>>>> 2023-05-01T19:15:54.137Z|00550|reconnect|ERR|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>> seconds, disconnecting
>>>> 2023-05-01T19:15:54.137Z|00551|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connection dropped
>>>> 2023-05-01T19:16:02.184Z|00552|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connected
>>>> 2023-05-01T19:16:14.488Z|00553|reconnect|ERR|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 
>>>> seconds, disconnecting
>>>> 2023-05-01T19:16:14.488Z|00554|reconnect|INFO|tcp:10.193.1.2:6642 
>>>> <http://10.193.1.2:6642/>: connection dropped
>>>> 
>>>>  
>>>> 
>>>> This happened after pushing a configuration to north db for around 250 
>>>> logical switch ports.
>>>> 
>>>> Once I turn on the VM's everything goes bad very quickly,
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> 2023-05-01T04:27:09.294Z|01947|poll_loop|INFO|wakeup due to [POLLOUT] on 
>>>> fd 66 (10.193.200.6:6642 <http://10.193.200.6:6642/><->10.193.0.102:48794 
>>>> <http://10.193.0.102:48794/>) at ../lib/stream-fd.c:153 (100% CPU usage)
>>>> 
>>>>  
>>>> 
>>>> Can anyone provide any guidance how to run down an issue like this ?
>>> 
>>> _______________________________________________
>>> discuss mailing list
>>> disc...@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> 
>> 
>> 
>> 
>> Regards,
>> Vladislav Odintsov
>> 
> 
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Regards,
Vladislav Odintsov

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db unstable

Reply via email to