Re: [ovs-discuss] [OVN] DBs RAFT Cluster Issue

Ilya Maximets via discuss Wed, 27 Nov 2024 02:51:05 -0800

On 10/17/24 13:29, Ammad Syed via discuss wrote:
> Thanks Numan, in my production setup with openstack, I am running ovn northd, 
> sb and nb db on single VM (22.03.3). Reason for evaluating this cluster based 
> ovn setup is to reduce cpu utilization of 100% sometime we see in in our 
> production that cause issues with neutron and ovn-controller connectivities.
> 
> 2024-10-17T11:00:53.934Z|114596|poll_loop|INFO|wakeup due to 1-ms timeout at 
> northd/inc-proc-northd.c:279 (59% CPU usage)
> 2024-10-17T11:00:53.942Z|114597|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (59% CPU usage)
> 2024-10-17T11:00:54.254Z|114598|poll_loop|INFO|wakeup due to [POLLIN] on fd 3 
> (<->/var/run/ovn/ovnnb_db.sock) at lib/stream-fd.c:157 (59% CPU usage)
> 2024-10-17T11:00:54.572Z|114599|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (59% CPU usage)
> 2024-10-17T11:05:20.624Z|114600|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (66% CPU usage)
> 2024-10-17T11:05:20.954Z|114601|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (66% CPU usage)
> 2024-10-17T11:06:00.607Z|114602|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (86% CPU usage)
> 2024-10-17T11:06:02.607Z|114603|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (86% CPU usage)
> 2024-10-17T11:16:00.629Z|114604|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
> 2024-10-17T11:16:01.629Z|114605|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
> 2024-10-17T11:16:01.954Z|114606|poll_loop|INFO|wakeup due to [POLLIN] on fd 3 
> (<->/var/run/ovn/ovnnb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
> 2024-10-17T11:16:02.282Z|114607|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
> 2024-10-17T11:16:02.282Z|114608|poll_loop|INFO|wakeup due to [POLLIN] on fd 3 
> (<->/var/run/ovn/ovnnb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
> 2024-10-17T11:16:02.282Z|114609|poll_loop|INFO|wakeup due to 0-ms timeout at 
> northd/inc-proc-northd.c:279 (77% CPU usage)
> 
> Does moving to cluster setup helps with such issues ?



The logs above are from ovn-northd process.  As Numan said, only one
ovn-northd porcess will be active at any given time regardless of the
setup.  So, the load on this process will not change.  The main reasons
to move to clustered setup are fault tolerance and reduction of the load
on the southbound database server.  But it will not help with load on
northd.

I'd suggest you upgrade from 22.03 to 24.03.  This may have a huge
impact on performance as northd had a lot of optimizations in between
these versions.  Moving to 24.09 to get more incremental processing
might also be a thing to concider.

In general, in 22.03, most of the northbound database changes trigger
a full recompute of the content of the southbound database.  So, if
the databases are large, you'll see 100% CPU utilization on northd
frequently.  It's normal.  Later versions like 24.09 have much more
incremental processing capabilities and so should consume much less CPU.
However, just upgrading to something like 24.03 LTS may also be a good
upgrade to reduce the load.

Best regards, Ilya Maximets.

> 
> On Wed, Oct 16, 2024 at 7:53 PM Numan Siddique <num...@ovn.org 
> <mailto:num...@ovn.org>> wrote:
> 
>     On Wed, Oct 16, 2024 at 9:12 AM Ammad Syed via discuss
>     <ovs-discuss@openvswitch.org <mailto:ovs-discuss@openvswitch.org>> wrote:
>     >
>     > Hi,
>     >
>     > I am testing ovn 3 nodes clustering with ssl setup on ovn 24.04.2.
>     >
>     > These are ovn options that I have set on node 1.
>     >
>     > OVN_CTL_OPTS=" \
>     >   --db-nb-create-insecure-remote=no \
>     >   --db-sb-create-insecure-remote=no \
>     >   --db-nb-addr=172.16.60.40 \
>     >   --db-sb-addr=172.16.60.40 \
>     >   --db-nb-cluster-local-addr=172.16.60.40 \
>     >   --db-nb-cluster-local-proto=ssl \
>     >   --db-sb-cluster-local-addr=172.16.60.40 \
>     >   --db-sb-cluster-local-proto=ssl \
>     >   --ovn-northd-nb-db=ssl:172.16.60.40:6641 <http://172.16.60.40:6641> \
>     >   --ovn-northd-sb-db=ssl:172.16.60.40:6642 <http://172.16.60.40:6642> \
>     >   --ovn-northd-nb-db=ssl:172.16.60.40:6641 
> <http://172.16.60.40:6641>,ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>,ssl:172.16.60.42:6641 <http://172.16.60.42:6641> \
>     >   --ovn-northd-sb-db=ssl:172.16.60.40:6642 
> <http://172.16.60.40:6642>,ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>,ssl:172.16.60.42:6642 <http://172.16.60.42:6642> \
>     >   --ovn-nb-db-ssl-key=/etc/ovn/ovn-cert/ovnnb-privkey.pem \
>     >   --ovn-nb-db-ssl-cert=/etc/ovn/ovn-cert/ovnnb-cert.pem \
>     >   --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
>     >   --ovn-sb-db-ssl-key=/etc/ovn/ovn-cert/ovnsb-privkey.pem \
>     >   --ovn-sb-db-ssl-cert=/etc/ovn/ovn-cert/ovnsb-cert.pem \
>     >   --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
>     >   --ovn-northd-ssl-key=/etc/ovn/ovn-cert/ovnnorthd-privkey.pem \
>     >   --ovn-northd-ssl-cert=/etc/ovn/ovn-cert/ovnnorthd-cert.pem \
>     >   --ovn-northd-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
>     > "
>     >
>     > On the second and third node I have used below options.
>     >
>     > OVN_CTL_OPTS=" \
>     >   --db-nb-create-insecure-remote=no \
>     >   --db-sb-create-insecure-remote=no \
>     >   --db-nb-addr=172.16.60.41 \
>     >   --db-sb-addr=172.16.60.41 \
>     >   --db-nb-cluster-local-addr=172.16.60.41 \
>     >   --db-nb-cluster-local-proto=ssl \
>     >   --db-sb-cluster-local-addr=172.16.60.41 \
>     >   --db-sb-cluster-local-proto=ssl \
>     >   --db-nb-cluster-remote-addr=172.16.60.40 \
>     >   --db-nb-cluster-remote-proto=ssl \
>     >   --db-sb-cluster-remote-addr=172.16.60.40 \
>     >   --db-sb-cluster-remote-proto=ssl \
>     >   --ovn-northd-nb-db=ssl:172.16.60.40:6641 
> <http://172.16.60.40:6641>,ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>,ssl:172.16.60.42:6641 <http://172.16.60.42:6641> \
>     >   --ovn-northd-sb-db=ssl:172.16.60.40:6642 
> <http://172.16.60.40:6642>,ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>,ssl:172.16.60.42:6642 <http://172.16.60.42:6642> \
>     >   --ovn-nb-db-ssl-key=/etc/ovn/ovn-cert/ovnnb-privkey.pem \
>     >   --ovn-nb-db-ssl-cert=/etc/ovn/ovn-cert/ovnnb-cert.pem \
>     >   --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
>     >   --ovn-sb-db-ssl-key=/etc/ovn/ovn-cert/ovnsb-privkey.pem \
>     >   --ovn-sb-db-ssl-cert=/etc/ovn/ovn-cert/ovnsb-cert.pem \
>     >   --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
>     >   --ovn-northd-ssl-key=/etc/ovn/ovn-cert/ovnnorthd-privkey.pem \
>     >   --ovn-northd-ssl-cert=/etc/ovn/ovn-cert/ovnnorthd-cert.pem \
>     >   --ovn-northd-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
>     >   --ovn-northd-nb-db=ssl:172.16.60.41:6641 <http://172.16.60.41:6641> \
>     >   --ovn-northd-sb-db=ssl:172.16.60.41:6642 <http://172.16.60.41:6642> \
>     > "
>     >
>     > Here is the cluster status.
>     >
>     > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
>     > db6a
>     > Name: OVN_Northbound
>     > Cluster ID: 5502 (5502d208-61dc-4eee-bd15-dc0dc52bf379)
>     > Server ID: db6a (db6a618a-bf77-4f46-b08d-ebf15d538ee5)
>     > Address: ssl:172.16.60.42:6643 <http://172.16.60.42:6643>
>     > Status: cluster member
>     > Role: leader
>     > Term: 12
>     > Leader: self
>     > Vote: self
>     >
>     > Last Election started 3584828 ms ago, reason: leadership_transfer
>     > Last Election won: 3584825 ms ago
>     > Election timer: 1000
>     > Log: [2, 17]
>     > Entries not yet committed: 0
>     > Entries not yet applied: 0
>     > Connections: ->f588 ->1902 <-f588 <-1902
>     > Disconnections: 3
>     > Servers:
>     >     f588 (f588 at ssl:172.16.60.40:6643 <http://172.16.60.40:6643>) 
> next_index=17 match_index=16 last msg 75 ms ago
>     >     db6a (db6a at ssl:172.16.60.42:6643 <http://172.16.60.42:6643>) 
> (self) next_index=15 match_index=16
>     >     1902 (1902 at ssl:172.16.60.41:6643 <http://172.16.60.41:6643>) 
> next_index=17 match_index=16 last msg 75 ms ago
>     >
>     > The issue is I am seeing below logs in follower instances continuously.
>     >
>     > 2024-10-16T13:05:51.106Z|03078|ovsdb_cs|INFO|ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>: clustered database server is not cluster leader; 
> trying another server
>     > 2024-10-16T13:05:51.106Z|03079|ovn_northd|INFO|ovn-northd lock 
> acquired. This ovn-northd instance is now active.
>     > 2024-10-16T13:05:51.106Z|03080|ovsdb_cs|INFO|ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>: clustered database server is not cluster leader; 
> trying another server
>     > 2024-10-16T13:05:51.107Z|03081|ovn_northd|INFO|ovn-northd lock lost. 
> This ovn-northd instance is now on standby.
>     > 2024-10-16T13:05:59.116Z|03082|reconnect|INFO|ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>: connected
>     > 2024-10-16T13:05:59.118Z|03083|reconnect|INFO|ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>: connected
>     > 2024-10-16T13:05:59.118Z|03084|ovsdb_cs|INFO|ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>: clustered database server is not cluster leader; 
> trying another server
>     > 2024-10-16T13:05:59.119Z|03085|ovn_northd|INFO|ovn-northd lock 
> acquired. This ovn-northd instance is now active.
>     > 2024-10-16T13:05:59.119Z|03086|ovsdb_cs|INFO|ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>: clustered database server is not cluster leader; 
> trying another server
>     > 2024-10-16T13:05:59.119Z|03087|ovn_northd|INFO|ovn-northd lock lost. 
> This ovn-northd instance is now on standby.
>     > 2024-10-16T13:06:07.130Z|03088|reconnect|INFO|ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>: connected
>     > 2024-10-16T13:06:07.131Z|03089|reconnect|INFO|ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>: connected
>     > 2024-10-16T13:06:07.132Z|03090|ovsdb_cs|INFO|ssl:172.16.60.41:6641 
> <http://172.16.60.41:6641>: clustered database server is not cluster leader; 
> trying another server
>     > 2024-10-16T13:06:07.132Z|03091|ovn_northd|INFO|ovn-northd lock 
> acquired. This ovn-northd instance is now active.
>     > 2024-10-16T13:06:07.133Z|03092|ovsdb_cs|INFO|ssl:172.16.60.41:6642 
> <http://172.16.60.41:6642>: clustered database server is not cluster leader; 
> trying another server
>     > 2024-10-16T13:06:07.133Z|03093|ovn_northd|INFO|ovn-northd lock lost. 
> This ovn-northd instance is now on standby.
>     >
>     > These logs are being received in second and third node who are 
> followers. When I reboot node 1, RAFT elects a new leader in my case node 3 
> is selected and these logs disappear from node 3. When the node 1 came back 
> in follower state, it does not have these logs.
>     >
>     > Is there anything to be concerned about ? or is it normal ?
> 
>     These logs are from ovn-northd and looks like you're running 3
>     instances of ovn-northd.  All these 3 instances will connect to the
>     leaders of the both
>     NB and SB DB clusters.  And only one will be active and the other two
>     will be on standby.  All the 3 ovn-northd instances will try to get an
>     OVSDB
>     lock from the SB DB cluster leader and only one will get it.   I think
>     the logs you're seeing are normal.
> 
>     Thanks
>     Numan
> 
>     >
>     > --
>     > Regards,
>     >
>     > Ammad
>     > _______________________________________________
>     > discuss mailing list
>     > disc...@openvswitch.org <mailto:disc...@openvswitch.org>
>     > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss 
> <https://mail.openvswitch.org/mailman/listinfo/ovs-discuss>
> 
> 
> 
> -- 
> Regards,
> 
> 
> Syed Ammad Ali
> 
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] [OVN] DBs RAFT Cluster Issue

Reply via email to