Re: [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db unstable

Vladislav Odintsov via discuss Tue, 02 May 2023 07:45:48 -0700

Hi Ilya,

let me jump into this thread.


Right now I’m debugging the behaviour of ovn (22.09.x) and ovsdb-server 3.1.0 
where one ovsdb update3 request makes ovsdb-server, which acts as a relay for 
OVN Southbound DB with only 5-6 clients connected to it (ovn-controllers, 
acting as a central chassis for external access with enabled ha_group for edge 
LRs), utilize 100% CPU during 5-6 minutes.
During this time ovsdb relay failes to answer ovsdb inactivity probes and then 
clients and even upstream ovsdb-servers disconnect this ovsdb relay because of 
ping probe timeout of 60s. All the probe intervals configured to 60 seconds 
value (ovsdb-server SB cluster <-> ovsdb SB relay <-> ovn-controller). Earlier 
I’ve posted a long-read with some problems listed [1].

IIUC, this update is generated by ovn-northd after one LS with only one LSP 
type router and attached LB is removed.
You can see the request json here: [2]
Such updates appear not only if LS/LB is removed but also in some other 
operations, this is just an example.
So it seems like ovn-northd re-creates a big dp group and such update for some 
reason is difficult to handle for ovsdb relay (actually ovn-controllers  also 
utilize 100% cpu).

Have you seen such behaviour? Maybe you’ve got any suggestion about the reason 
and a possible fix for such huge load from one update3?

Thanks.

1: https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403699.html
2: https://gist.github.com/odivlad/bba4443e589a268a0f389c2972511df3


> On 2 May 2023, at 14:49, Ilya Maximets via discuss 
> <ovs-discuss@openvswitch.org> wrote:
> 
> Form my side, the first option would be increasing the inactivity
> probe on the ovn-controller side and see if that resolves the issue.
> Deployments typically have 60+ seconds set, just in case.
> 
> Also, if you're not already using latest versions of OVS/OVN, upgrade
> may resolve the issue as well.  For example, OVS 2.17 provides a big
> performance improvement over previous versions and 3.0 and 3.1 give
> even more on top.  And with new OVN releases, the southbound database
> size usually goes down significantly reducing the load on the OVSDB
> server.  I'd suggest to use releases after OVN 22.09 for large scale
> deployments.
> 
> However, if your setup have only one switch with 250 ports and you
> have an issue, that should not really be related to scale and you
> need to investigate further on what exactly is happening.
> 
> Best regards, Ilya Maximets.
> 
> On 5/2/23 08:58, Felix Hüttner via discuss wrote:
>> Hi Gavin,
>> 
>> we saw similar issues after reaching a certain number of hypervisors. This 
>> happened because our ovsdb processes ran at 100% cpu utilization (and they 
>> are not multithreaded).
>> 
>> Our solutions where:
>> 
>> 1. If you use ssl on your north-/southbound db. Disable it and add a tls 
>> terminating reverse proxy (like traefik) in front
>> 2. Increase the inactivity probe significantly (you might need to change it 
>> on the ovn-controller and ovsdb side, not sure anymore)
>> 3. Introduce ovsdb relays and connect the ovn-controllers there.
>> 
>> --
>> 
>> Felix Huettner
>> 
>>  
>> 
>> *From:* discuss <ovs-discuss-boun...@openvswitch.org> *On Behalf Of *Gavin 
>> McKee via discuss
>> *Sent:* Monday, May 1, 2023 9:20 PM
>> *To:* ovs-discuss <ovs-discuss@openvswitch.org>
>> *Subject:* [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db 
>> unstable
>> 
>> Hi ,
>> 
>> I'm having a pretty bad issue with OVN controller on the hypervisors being 
>> unable to connect to the OVS SB DB ,
>> 
>>  
>> 
>> 2023-05-01T19:13:33.969Z|00541|reconnect|ERR|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, 
>> disconnecting
>> 2023-05-01T19:13:33.969Z|00542|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connection dropped
>> 2023-05-01T19:13:43.043Z|00543|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connected
>> 2023-05-01T19:13:56.115Z|00544|reconnect|ERR|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, 
>> disconnecting
>> 2023-05-01T19:13:56.115Z|00545|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connection dropped
>> 2023-05-01T19:14:36.177Z|00546|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connected
>> 2023-05-01T19:14:44.996Z|00547|jsonrpc|WARN|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: receive error: Connection reset by peer
>> 2023-05-01T19:14:44.996Z|00548|reconnect|WARN|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connection dropped (Connection reset by peer)
>> 2023-05-01T19:15:44.131Z|00549|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connected
>> 2023-05-01T19:15:54.137Z|00550|reconnect|ERR|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, 
>> disconnecting
>> 2023-05-01T19:15:54.137Z|00551|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connection dropped
>> 2023-05-01T19:16:02.184Z|00552|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connected
>> 2023-05-01T19:16:14.488Z|00553|reconnect|ERR|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: no response to inactivity probe after 5 seconds, 
>> disconnecting
>> 2023-05-01T19:16:14.488Z|00554|reconnect|INFO|tcp:10.193.1.2:6642 
>> <http://10.193.1.2:6642/>: connection dropped
>> 
>>  
>> 
>> This happened after pushing a configuration to north db for around 250 
>> logical switch ports.
>> 
>> Once I turn on the VM's everything goes bad very quickly,
>> 
>>  
>> 
>>  
>> 
>> 2023-05-01T04:27:09.294Z|01947|poll_loop|INFO|wakeup due to [POLLOUT] on fd 
>> 66 (10.193.200.6:6642 <http://10.193.200.6:6642/><->10.193.0.102:48794 
>> <http://10.193.0.102:48794/>) at ../lib/stream-fd.c:153 (100% CPU usage)
>> 
>>  
>> 
>> Can anyone provide any guidance how to run down an issue like this ?
> 
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss




Regards,
Vladislav Odintsov

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] CPU pinned at 100% , ovn-controller to ovnsb_db unstable

Reply via email to