Re: [ovs-discuss] [OVN] DBs RAFT Cluster Issue

Ammad Syed via discuss Thu, 17 Oct 2024 04:29:52 -0700

Thanks Numan, in my production setup with openstack, I am running ovn
northd, sb and nb db on single VM (22.03.3). Reason for evaluating this
cluster based ovn setup is to reduce cpu utilization of 100% sometime we
see in in our production that cause issues with neutron and ovn-controller
connectivities.


2024-10-17T11:00:53.934Z|114596|poll_loop|INFO|wakeup due to 1-ms timeout
at northd/inc-proc-northd.c:279 (59% CPU usage)
2024-10-17T11:00:53.942Z|114597|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (59% CPU usage)
2024-10-17T11:00:54.254Z|114598|poll_loop|INFO|wakeup due to [POLLIN] on fd
3 (<->/var/run/ovn/ovnnb_db.sock) at lib/stream-fd.c:157 (59% CPU usage)
2024-10-17T11:00:54.572Z|114599|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (59% CPU usage)
2024-10-17T11:05:20.624Z|114600|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (66% CPU usage)
2024-10-17T11:05:20.954Z|114601|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (66% CPU usage)
2024-10-17T11:06:00.607Z|114602|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (86% CPU usage)
2024-10-17T11:06:02.607Z|114603|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (86% CPU usage)
2024-10-17T11:16:00.629Z|114604|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
2024-10-17T11:16:01.629Z|114605|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
2024-10-17T11:16:01.954Z|114606|poll_loop|INFO|wakeup due to [POLLIN] on fd
3 (<->/var/run/ovn/ovnnb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
2024-10-17T11:16:02.282Z|114607|poll_loop|INFO|wakeup due to [POLLIN] on fd
17 (<->/var/run/ovn/ovnsb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
2024-10-17T11:16:02.282Z|114608|poll_loop|INFO|wakeup due to [POLLIN] on fd
3 (<->/var/run/ovn/ovnnb_db.sock) at lib/stream-fd.c:157 (77% CPU usage)
2024-10-17T11:16:02.282Z|114609|poll_loop|INFO|wakeup due to 0-ms timeout
at northd/inc-proc-northd.c:279 (77% CPU usage)

Does moving to cluster setup helps with such issues ?

On Wed, Oct 16, 2024 at 7:53 PM Numan Siddique <num...@ovn.org> wrote:

> On Wed, Oct 16, 2024 at 9:12 AM Ammad Syed via discuss
> <ovs-discuss@openvswitch.org> wrote:
> >
> > Hi,
> >
> > I am testing ovn 3 nodes clustering with ssl setup on ovn 24.04.2.
> >
> > These are ovn options that I have set on node 1.
> >
> > OVN_CTL_OPTS=" \
> >   --db-nb-create-insecure-remote=no \
> >   --db-sb-create-insecure-remote=no \
> >   --db-nb-addr=172.16.60.40 \
> >   --db-sb-addr=172.16.60.40 \
> >   --db-nb-cluster-local-addr=172.16.60.40 \
> >   --db-nb-cluster-local-proto=ssl \
> >   --db-sb-cluster-local-addr=172.16.60.40 \
> >   --db-sb-cluster-local-proto=ssl \
> >   --ovn-northd-nb-db=ssl:172.16.60.40:6641 \
> >   --ovn-northd-sb-db=ssl:172.16.60.40:6642 \
> >   --ovn-northd-nb-db=ssl:172.16.60.40:6641,ssl:172.16.60.41:6641,ssl:
> 172.16.60.42:6641 \
> >   --ovn-northd-sb-db=ssl:172.16.60.40:6642,ssl:172.16.60.41:6642,ssl:
> 172.16.60.42:6642 \
> >   --ovn-nb-db-ssl-key=/etc/ovn/ovn-cert/ovnnb-privkey.pem \
> >   --ovn-nb-db-ssl-cert=/etc/ovn/ovn-cert/ovnnb-cert.pem \
> >   --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
> >   --ovn-sb-db-ssl-key=/etc/ovn/ovn-cert/ovnsb-privkey.pem \
> >   --ovn-sb-db-ssl-cert=/etc/ovn/ovn-cert/ovnsb-cert.pem \
> >   --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
> >   --ovn-northd-ssl-key=/etc/ovn/ovn-cert/ovnnorthd-privkey.pem \
> >   --ovn-northd-ssl-cert=/etc/ovn/ovn-cert/ovnnorthd-cert.pem \
> >   --ovn-northd-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
> > "
> >
> > On the second and third node I have used below options.
> >
> > OVN_CTL_OPTS=" \
> >   --db-nb-create-insecure-remote=no \
> >   --db-sb-create-insecure-remote=no \
> >   --db-nb-addr=172.16.60.41 \
> >   --db-sb-addr=172.16.60.41 \
> >   --db-nb-cluster-local-addr=172.16.60.41 \
> >   --db-nb-cluster-local-proto=ssl \
> >   --db-sb-cluster-local-addr=172.16.60.41 \
> >   --db-sb-cluster-local-proto=ssl \
> >   --db-nb-cluster-remote-addr=172.16.60.40 \
> >   --db-nb-cluster-remote-proto=ssl \
> >   --db-sb-cluster-remote-addr=172.16.60.40 \
> >   --db-sb-cluster-remote-proto=ssl \
> >   --ovn-northd-nb-db=ssl:172.16.60.40:6641,ssl:172.16.60.41:6641,ssl:
> 172.16.60.42:6641 \
> >   --ovn-northd-sb-db=ssl:172.16.60.40:6642,ssl:172.16.60.41:6642,ssl:
> 172.16.60.42:6642 \
> >   --ovn-nb-db-ssl-key=/etc/ovn/ovn-cert/ovnnb-privkey.pem \
> >   --ovn-nb-db-ssl-cert=/etc/ovn/ovn-cert/ovnnb-cert.pem \
> >   --ovn-nb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
> >   --ovn-sb-db-ssl-key=/etc/ovn/ovn-cert/ovnsb-privkey.pem \
> >   --ovn-sb-db-ssl-cert=/etc/ovn/ovn-cert/ovnsb-cert.pem \
> >   --ovn-sb-db-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
> >   --ovn-northd-ssl-key=/etc/ovn/ovn-cert/ovnnorthd-privkey.pem \
> >   --ovn-northd-ssl-cert=/etc/ovn/ovn-cert/ovnnorthd-cert.pem \
> >   --ovn-northd-ssl-ca-cert=/etc/ovn/ovn-cert/cacert.pem \
> >   --ovn-northd-nb-db=ssl:172.16.60.41:6641 \
> >   --ovn-northd-sb-db=ssl:172.16.60.41:6642 \
> > "
> >
> > Here is the cluster status.
> >
> > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> > db6a
> > Name: OVN_Northbound
> > Cluster ID: 5502 (5502d208-61dc-4eee-bd15-dc0dc52bf379)
> > Server ID: db6a (db6a618a-bf77-4f46-b08d-ebf15d538ee5)
> > Address: ssl:172.16.60.42:6643
> > Status: cluster member
> > Role: leader
> > Term: 12
> > Leader: self
> > Vote: self
> >
> > Last Election started 3584828 ms ago, reason: leadership_transfer
> > Last Election won: 3584825 ms ago
> > Election timer: 1000
> > Log: [2, 17]
> > Entries not yet committed: 0
> > Entries not yet applied: 0
> > Connections: ->f588 ->1902 <-f588 <-1902
> > Disconnections: 3
> > Servers:
> >     f588 (f588 at ssl:172.16.60.40:6643) next_index=17 match_index=16
> last msg 75 ms ago
> >     db6a (db6a at ssl:172.16.60.42:6643) (self) next_index=15
> match_index=16
> >     1902 (1902 at ssl:172.16.60.41:6643) next_index=17 match_index=16
> last msg 75 ms ago
> >
> > The issue is I am seeing below logs in follower instances continuously.
> >
> > 2024-10-16T13:05:51.106Z|03078|ovsdb_cs|INFO|ssl:172.16.60.41:6641:
> clustered database server is not cluster leader; trying another server
> > 2024-10-16T13:05:51.106Z|03079|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> > 2024-10-16T13:05:51.106Z|03080|ovsdb_cs|INFO|ssl:172.16.60.41:6642:
> clustered database server is not cluster leader; trying another server
> > 2024-10-16T13:05:51.107Z|03081|ovn_northd|INFO|ovn-northd lock lost.
> This ovn-northd instance is now on standby.
> > 2024-10-16T13:05:59.116Z|03082|reconnect|INFO|ssl:172.16.60.41:6641:
> connected
> > 2024-10-16T13:05:59.118Z|03083|reconnect|INFO|ssl:172.16.60.41:6642:
> connected
> > 2024-10-16T13:05:59.118Z|03084|ovsdb_cs|INFO|ssl:172.16.60.41:6641:
> clustered database server is not cluster leader; trying another server
> > 2024-10-16T13:05:59.119Z|03085|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> > 2024-10-16T13:05:59.119Z|03086|ovsdb_cs|INFO|ssl:172.16.60.41:6642:
> clustered database server is not cluster leader; trying another server
> > 2024-10-16T13:05:59.119Z|03087|ovn_northd|INFO|ovn-northd lock lost.
> This ovn-northd instance is now on standby.
> > 2024-10-16T13:06:07.130Z|03088|reconnect|INFO|ssl:172.16.60.41:6641:
> connected
> > 2024-10-16T13:06:07.131Z|03089|reconnect|INFO|ssl:172.16.60.41:6642:
> connected
> > 2024-10-16T13:06:07.132Z|03090|ovsdb_cs|INFO|ssl:172.16.60.41:6641:
> clustered database server is not cluster leader; trying another server
> > 2024-10-16T13:06:07.132Z|03091|ovn_northd|INFO|ovn-northd lock acquired.
> This ovn-northd instance is now active.
> > 2024-10-16T13:06:07.133Z|03092|ovsdb_cs|INFO|ssl:172.16.60.41:6642:
> clustered database server is not cluster leader; trying another server
> > 2024-10-16T13:06:07.133Z|03093|ovn_northd|INFO|ovn-northd lock lost.
> This ovn-northd instance is now on standby.
> >
> > These logs are being received in second and third node who are
> followers. When I reboot node 1, RAFT elects a new leader in my case node 3
> is selected and these logs disappear from node 3. When the node 1 came back
> in follower state, it does not have these logs.
> >
> > Is there anything to be concerned about ? or is it normal ?
>
> These logs are from ovn-northd and looks like you're running 3
> instances of ovn-northd.  All these 3 instances will connect to the
> leaders of the both
> NB and SB DB clusters.  And only one will be active and the other two
> will be on standby.  All the 3 ovn-northd instances will try to get an
> OVSDB
> lock from the SB DB cluster leader and only one will get it.   I think
> the logs you're seeing are normal.
>
> Thanks
> Numan
>
> >
> > --
> > Regards,
> >
> > Ammad
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>


-- 
Regards,


Syed Ammad Ali

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] [OVN] DBs RAFT Cluster Issue

Reply via email to