Re: [ovs-discuss] OVN ovsdb NB using 100% CPU

Alin Serdean via discuss Wed, 26 Feb 2025 10:42:42 -0800

Afair that also verifies the integrity of the DB.

The only commands that I can think that might trim the content is backup
and restore.



On Wed, Feb 26, 2025 at 7:32 PM Tiago Pires <[email protected]>
wrote:

> On Wed, Feb 26, 2025 at 3:23 PM Alin Serdean <[email protected]>
> wrote:
> >
> > Hi Tiago,
> >
> > If the cluster is accessible you can use something like the following:
> >
> > ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> > ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
>
> Alin,
>
> I mean, how to check if the NB database content is not breaking the
> ovsdb-server NB, making it to use 100% CPU.
>
> Regards,
>
> Tiago Pires
>
>
> > please note that the location might be in a different location in your
> case.
> >
> > Alin.
> >
> > On Wed, Feb 26, 2025 at 6:48 PM Tiago Pires via discuss <
> [email protected]> wrote:
> >>
> >> Hi all,
> >>
> >> I have an update, If I start in standalone mode using this database,
> >> that comes up in less than 30 seconds.
> >> But if I try to use the database in cluster mode it takes more than 20
> >> minutes to come up and after that the leader will remain in 100% of
> >> load.
> >> In cluster mode, the ovnnb_db.ctl and ovnnb_db.sock are not created at
> >> the moment that ovsdb-server NB is started, it is created only after 5
> >> minutes.
> >> It seems something on the DB is wrong and makes the OVSDB NB take this
> >> long time to start.
> >>
> >> Is there a way to check if the database is healthy?
> >>
> >> Regards,
> >>
> >> Tiago Pires
> >>
> >> On Tue, Feb 25, 2025 at 12:17 PM Tiago Pires <[email protected]>
> wrote:
> >> >
> >> > On Tue, Feb 25, 2025 at 11:43 AM Felix Huettner
> >> > <[email protected]> wrote:
> >> > >
> >> > > On Tue, Feb 25, 2025 at 11:24:34AM -0300, Tiago Pires via discuss
> wrote:
> >> > > > Hi Felix,
> >> > >
> >> > > Hi Tiago,
> >> > >
> >> > > >
> >> > > > The local leader has these append messages before send the
> reply(next
> >> > > > below log message):
> >> > > > 2025-02-25T14:03:47.776Z|00764|jsonrpc|DBG|ssl:170.168.0.X:39452:
> send
> >> > > > notification, method="append_request",
> >> > > >
> params=[{"cluster":"56b3aab6-476f-4ce1-96b9-1588dd4176c9","comment":"heartbeat","from":"11a8329d-bb6f-4e76-849b-090be09c030d","leader_commit":57,"log":[],"prev_log_index":57,"prev_log_term":2,"term":2,"to":"3967f0d3-ed57-4433-861b-e1548d78639f"}]
> >> > >
> >> > > So if i read the code correctly this is a regular raft heartbeat.
> That
> >> > > is send every "election_timer/3". Based on the election timer you
> >> > > provided below that should then happen every 20 seconds.
> >> > >
> >> > > >
> >> > > > Here is the initial of the message with the whole database as
> answer:
> >> > > > 2025-02-25T14:03:53.469Z|00765|jsonrpc|DBG|ssl:170.168.0.X:51420:
> send
> >> > > > reply,
> result=[false,"0e044970-54c2-4918-8496-0ad6bb3d5f45",{"ACL":{"001cc1e8-2b1c-4935-addf-6ca20ad45e21":{"initial":{"action":"allow-related",
> >> > >
> >> > > I assume the IP is the same as the one in the log above? So same
> system
> >> > > just with different ports?
> >> > >
> >> > > If yes then maybe this is some other process on the remote host
> that for
> >> > > whatever reason dumps the whole northbound database?
> >> > >
> >> > > At least with the "initial" string in there it looks like a monitor
> >> > > request that has just been sent. If it has no filtering it would
> get the
> >> > > whole database.
> >> > >
> >> >
> >> > Hi Felix,
> >> >
> >> > The IP is from a non-leader member of the cluster that is running the
> >> > regular OVN cluster processes, the output log is from an
> >> > ovn-fake-multinode setup that I set up to reproduce the issue.
> >> > So it is a fresh setup using the DBs that has this strange behavior. I
> >> > already destroyed and recreated this setup a few times with the same
> >> > behavior.
> >> > Maybe if you have time I can share the DB with you for you to take a
> >> > look, the DB is from a Lab env(non-sensitive data) but I'm afraid that
> >> > it can happen in a production env without knowing what is happening.
> >> >
> >> > Let me know if you agree with that.
> >> >
> >> > Thanks for your time.
> >> >
> >> > Tiago Pires
> >> >
> >> > > So i would propose to search if this IP and port are the same all
> the
> >> > > time and then find out what process that actually is.
> >> > >
> >> > > >
> >> > > > Could it be something to investigate?
> >> > > >
> >> > > > Regards,
> >> > > >
> >> > > > Tiago Pires
> >> > > >
> >> > > > On Tue, Feb 25, 2025 at 10:26 AM Tiago Pires <
> [email protected]> wrote:
> >> > > > >
> >> > > > > On Tue, Feb 25, 2025 at 7:21 AM Felix Huettner
> >> > > > > <[email protected]> wrote:
> >> > > > > >
> >> > > > > > On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via
> discuss wrote:
> >> > > > > > > Hi all,
> >> > > > > > >
> >> > > > > > > I have an OVN Central cluster where the leader of the ovsdb
> NB started
> >> > > > > > > to use 100% of CPU load most of the time:
> >> > > > > > >
> >> > > > > > > 206 root      20   0   11.6g   4.7g   7172 R 106.7   0.3
>  2059:59
> >> > > > > > > ovsdb-server -vconsole:off -vfile:info
> >> > > > > > > --log-file=/var/log/ovn/ovsdb-server-nb.log
> >> > > > > > >
> >> > > > > > > While in 100% of CPU the read and write operations of the
> NB cluster
> >> > > > > > > is impacted. Doing a debug when there is this increase of
> CPU load, I
> >> > > > > > > can see a jsonrpc reply to a member of the cluster with the
> size of
> >> > > > > > > 460MB, almost the same size as the NB database. I set up an
> >> > > > > > > ovn-fake-multinode cluster and imported this database there
> and the
> >> > > > > > > behavior is still the same.
> >> > > > > > > At least the leader is not changing frequently since the
> election
> >> > > > > > > timer is in 60secs.
> >> > > > > > > And I have already tested with OVN 24.03 and no luck, same
> behavior.
> >> > > > > >
> >> > > > > > Hi Tiago,
> >> > > > > >
> >> > > > > > so if i get that correctly a non-leader member of the raft
> cluster
> >> > > > > > regularly requests the whole database content.
> >> > > > > > How often does that happen and can you correlate that with
> anything on
> >> > > > > > that non-leader member? Maybe that member crashes or gets
> restarted for
> >> > > > > > some reason?
> >> > > > > >
> >> > > > > > Note that the OVN version does not necessarily say anything
> about the
> >> > > > > > OVS version. And the ovs version is what provides the code of
> the ovsdb
> >> > > > > > server. So that version would be interesting as well.
> >> > > > > >
> >> > > > >
> >> > > > > Hi Felix,
> >> > > > >
> >> > > > > You got well, in this scenario both non-leader of the raft
> cluster.
> >> > > > > In the leader the jsonrpc reply can happen to both non-leader
> and it
> >> > > > > happens around each 10secs.
> >> > > > > I checked the non-leaders and their ovsdb processes are not
> crashing
> >> > > > > or getting restarted.
> >> > > > > The OVS version tested is 3.3.4.
> >> > > > >
> >> > > > > > >
> >> > > > > > > The coverage figures are not so well clear to me:
> >> > > > > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show
> >> > > > > > > Event coverage, avg rate over last: 5 seconds, last minute,
> last hour,
> >> > > > > > >  hash=6087dcfb:
> >> > > > > > > raft_entry_serialize       0.0/sec     0.000/sec
> 0.0000/sec   total: 59
> >> > > > > > > hmap_pathological          5.8/sec     3.667/sec
> 3.5750/sec
> >> > > > > > > total: 585411
> >> > > > > > > hmap_expand              79729.0/sec 53153.200/sec
> 51825.3172/sec
> >> > > > > > > total: 8484601546
> >> > > > > > > hmap_reserve               0.0/sec     0.000/sec
> 0.0000/sec   total: 48
> >> > > > > > > lockfile_lock              0.0/sec     0.000/sec
> 0.0000/sec   total: 1
> >> > > > > > > poll_create_node           3.6/sec     4.317/sec
> 4.4372/sec
> >> > > > > > > total: 3587083
> >> > > > > > > poll_zero_timeout          0.6/sec     0.150/sec
> 0.1286/sec
> >> > > > > > > total: 105735
> >> > > > > > > seq_change                 0.6/sec     0.417/sec
> 0.4158/sec
> >> > > > > > > total: 375960
> >> > > > > > > pstream_open               0.0/sec     0.000/sec
> 0.0000/sec   total: 4
> >> > > > > > > stream_open                0.0/sec     0.000/sec
> 0.0000/sec   total: 3
> >> > > > > > > unixctl_received           0.0/sec     0.017/sec
> 0.0003/sec   total: 11
> >> > > > > > > unixctl_replied            0.0/sec     0.017/sec
> 0.0003/sec   total: 11
> >> > > > > > > util_xalloc              3427998.6/sec 2285349.950/sec
> >> > > > > > > 1035236.3394/sec   total: 364876387809
> >> > > > > > > 100 events never hit
> >> > > > > > >
> >> > > > > > > Do you guys have any other way to debug it?
> >> > > > > >
> >> > > > > > Can you share the cluster status of both the leader and the
> node that
> >> > > > > > always requests the database? maybe that helps.
> >> > > > > >
> >> > > > > Below are the cluster status from each node:
> >> > > > >
> >> > > > > #leader
> >> > > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status
> OVN_Northbound
> >> > > > > 9944
> >> > > > > Name: OVN_Northbound
> >> > > > > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> >> > > > > Server ID: 9944 (99443341-5656-464d-b242-85bb16338570)
> >> > > > > Address: ssl:170.168.0.4:6643
> >> > > > > Status: cluster member
> >> > > > > Role: leader
> >> > > > > Term: 9
> >> > > > > Leader: self
> >> > > > > Vote: self
> >> > > > >
> >> > > > > Last Election started 66169932 ms ago, reason:
> leadership_transfer
> >> > > > > Last Election won: 66169930 ms ago
> >> > > > > Election timer: 60000
> >> > > > > Log: [66, 67]
> >> > > > > Entries not yet committed: 0
> >> > > > > Entries not yet applied: 0
> >> > > > > Connections: ->0000 <-6aee <-7b89 ->7b89
> >> > > > > Disconnections: 1
> >> > > > > Servers:
> >> > > > >     9944 (9944 at ssl:170.168.0.4:6643) (self) next_index=66
> match_index=66
> >> > > > >     6aee (6aee at ssl:170.168.0.2:6643) next_index=67
> match_index=66
> >> > > > > last msg 7857 ms ago
> >> > > > >     7b89 (7b89 at ssl:170.168.0.3:6643) next_index=67
> match_index=66
> >> > > > > last msg 7857 ms ago
> >> > > > >
> >> > > > > #non-leader 1
> >> > > > > # ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status
> OVN_Northbound
> >> > > > > 6aee
> >> > > > > Name: OVN_Northbound
> >> > > > > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> >> > > > > Server ID: 6aee (6aee85c6-bd3e-45d5-896e-264ed7eaec00)
> >> > > > > Address: ssl:170.168.0.2:6643
> >> > > > > Status: cluster member
> >> > > > > Role: follower
> >> > > > > Term: 9
> >> > > > > Leader: 9944
> >> > > > > Vote: 9944
> >> > > > >
> >> > > > > Last Election started 66336770 ms ago, reason:
> leadership_transfer
> >> > > > > Last Election won: 66336767 ms ago
> >> > > > > Election timer: 60000
> >> > > > > Log: [67, 67]
> >> > > > > Entries not yet committed: 0
> >> > > > > Entries not yet applied: 0
> >> > > > > Connections: <-7b89 ->7b89 <-9944 ->9944
> >> > > > > Disconnections: 0
> >> > > > > Servers:
> >> > > > >     9944 (9944 at ssl:170.168.0.4:6643) last msg 11573 ms ago
> >> > > > >     6aee (6aee at ssl:170.168.0.2:6643) (self)
> >> > > > >     7b89 (7b89 at ssl:170.168.0.3:6643) last msg 66173010 ms
> ago
> >> > > > >
> >> > > > > #non-leader 2
> >> > > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status
> OVN_Northbound
> >> > > > > 7b89
> >> > > > > Name: OVN_Northbound
> >> > > > > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> >> > > > > Server ID: 7b89 (7b892543-9c2f-43bc-b62c-2941491dbe56)
> >> > > > > Address: ssl:170.168.0.3:6643
> >> > > > > Status: cluster member
> >> > > > > Role: follower
> >> > > > > Term: 9
> >> > > > > Leader: 9944
> >> > > > > Vote: 9944
> >> > > > >
> >> > > > > Election timer: 60000
> >> > > > > Log: [66, 67]
> >> > > > > Entries not yet committed: 0
> >> > > > > Entries not yet applied: 0
> >> > > > > Connections: ->0000 <-6aee ->9944 <-9944
> >> > > > > Disconnections: 1
> >> > > > > Servers:
> >> > > > >     9944 (9944 at ssl:170.168.0.4:6643) last msg 32288 ms ago
> >> > > > >     6aee (6aee at ssl:170.168.0.2:6643) last msg 66228156 ms
> ago
> >> > > > >     7b89 (7b89 at ssl:170.168.0.3:6643) (self)
> >> > >
> >> > > All of these look from my perspective like a normal healthy cluster.
> >> > >
> >> > > Lets see if the above helps in any way.
> >> > >
> >> > > Thanks,
> >> > > Felix
> >> > >
> >> > > > >
> >> > > > > Thank you
> >> > > > >
> >> > > > > Regards,
> >> > > > >
> >> > > > > Tiago Pires
> >> > > > >
> >> > > > > >
> >> > > > > > Thanks a lot,
> >> > > > > > Felix
> >> > > > > >
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > >
> >> > > > > > > Tiago Pires
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > _‘Esta mensagem é direcionada apenas para os endereços
> constantes no
> >> > > > > > > cabeçalho inicial. Se você não está listado nos endereços
> constantes no
> >> > > > > > > cabeçalho, pedimos-lhe que desconsidere completamente o
> conteúdo dessa
> >> > > > > > > mensagem e cuja cópia, encaminhamento e/ou execução das
> ações citadas estão
> >> > > > > > > imediatamente anuladas e proibidas’._
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > * **‘Apesar do Magazine Luiza tomar
> >> > > > > > > todas as precauções razoáveis para assegurar que nenhum
> vírus esteja
> >> > > > > > > presente nesse e-mail, a empresa não poderá aceitar a
> responsabilidade por
> >> > > > > > > quaisquer perdas ou danos causados por esse e-mail ou por
> seus anexos’.*
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > _______________________________________________
> >> > > > > > > discuss mailing list
> >> > > > > > > [email protected]
> >> > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >> > > >
> >> > > > --
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > _‘Esta mensagem é direcionada apenas para os endereços constantes
> no
> >> > > > cabeçalho inicial. Se você não está listado nos endereços
> constantes no
> >> > > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo
> dessa
> >> > > > mensagem e cuja cópia, encaminhamento e/ou execução das ações
> citadas estão
> >> > > > imediatamente anuladas e proibidas’._
> >> > > >
> >> > > >
> >> > > > * **‘Apesar do Magazine Luiza tomar
> >> > > > todas as precauções razoáveis para assegurar que nenhum vírus
> esteja
> >> > > > presente nesse e-mail, a empresa não poderá aceitar a
> responsabilidade por
> >> > > > quaisquer perdas ou danos causados por esse e-mail ou por seus
> anexos’.*
> >> > > >
> >> > > >
> >> > > >
> >> > > > _______________________________________________
> >> > > > discuss mailing list
> >> > > > [email protected]
> >> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >>
> >> --
> >>
> >>
> >>
> >>
> >> _‘Esta mensagem é direcionada apenas para os endereços constantes no
> >> cabeçalho inicial. Se você não está listado nos endereços constantes no
> >> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> >> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas
> estão
> >> imediatamente anuladas e proibidas’._
> >>
> >>
> >> * **‘Apesar do Magazine Luiza tomar
> >> todas as precauções razoáveis para assegurar que nenhum vírus esteja
> >> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade
> por
> >> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> >>
> >>
> >>
> >> _______________________________________________
> >> discuss mailing list
> >> [email protected]
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> --
>
>
>
>
> _‘Esta mensagem é direcionada apenas para os endereços constantes no
> cabeçalho inicial. Se você não está listado nos endereços constantes no
> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas
> estão
> imediatamente anuladas e proibidas’._
>
>
> * **‘Apesar do Magazine Luiza tomar
> todas as precauções razoáveis para assegurar que nenhum vírus esteja
> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por
> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
>
>
>
>

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN ovsdb NB using 100% CPU

Reply via email to