On Tue, Feb 25, 2025 at 7:21 AM Felix Huettner
<felix.huettner@stackit.cloud> wrote:
>
> On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via discuss wrote:
> > Hi all,
> >
> > I have an OVN Central cluster where the leader of the ovsdb NB started
> > to use 100% of CPU load most of the time:
> >
> > 206 root      20   0   11.6g   4.7g   7172 R 106.7   0.3   2059:59
> > ovsdb-server -vconsole:off -vfile:info
> > --log-file=/var/log/ovn/ovsdb-server-nb.log
> >
> > While in 100% of CPU the read and write operations of the NB cluster
> > is impacted. Doing a debug when there is this increase of CPU load, I
> > can see a jsonrpc reply to a member of the cluster with the size of
> > 460MB, almost the same size as the NB database. I set up an
> > ovn-fake-multinode cluster and imported this database there and the
> > behavior is still the same.
> > At least the leader is not changing frequently since the election
> > timer is in 60secs.
> > And I have already tested with OVN 24.03 and no luck, same behavior.
>
> Hi Tiago,
>
> so if i get that correctly a non-leader member of the raft cluster
> regularly requests the whole database content.
> How often does that happen and can you correlate that with anything on
> that non-leader member? Maybe that member crashes or gets restarted for
> some reason?
>
> Note that the OVN version does not necessarily say anything about the
> OVS version. And the ovs version is what provides the code of the ovsdb
> server. So that version would be interesting as well.
>

Hi Felix,

You got well, in this scenario both non-leader of the raft cluster.
In the leader the jsonrpc reply can happen to both non-leader and it
happens around each 10secs.
I checked the non-leaders and their ovsdb processes are not crashing
or getting restarted.
The OVS version tested is 3.3.4.

> >
> > The coverage figures are not so well clear to me:
> > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show
> > Event coverage, avg rate over last: 5 seconds, last minute, last hour,
> >  hash=6087dcfb:
> > raft_entry_serialize       0.0/sec     0.000/sec        0.0000/sec   total: 
> > 59
> > hmap_pathological          5.8/sec     3.667/sec        3.5750/sec
> > total: 585411
> > hmap_expand              79729.0/sec 53153.200/sec    51825.3172/sec
> > total: 8484601546
> > hmap_reserve               0.0/sec     0.000/sec        0.0000/sec   total: 
> > 48
> > lockfile_lock              0.0/sec     0.000/sec        0.0000/sec   total: 
> > 1
> > poll_create_node           3.6/sec     4.317/sec        4.4372/sec
> > total: 3587083
> > poll_zero_timeout          0.6/sec     0.150/sec        0.1286/sec
> > total: 105735
> > seq_change                 0.6/sec     0.417/sec        0.4158/sec
> > total: 375960
> > pstream_open               0.0/sec     0.000/sec        0.0000/sec   total: 
> > 4
> > stream_open                0.0/sec     0.000/sec        0.0000/sec   total: 
> > 3
> > unixctl_received           0.0/sec     0.017/sec        0.0003/sec   total: 
> > 11
> > unixctl_replied            0.0/sec     0.017/sec        0.0003/sec   total: 
> > 11
> > util_xalloc              3427998.6/sec 2285349.950/sec
> > 1035236.3394/sec   total: 364876387809
> > 100 events never hit
> >
> > Do you guys have any other way to debug it?
>
> Can you share the cluster status of both the leader and the node that
> always requests the database? maybe that helps.
>
Below are the cluster status from each node:

#leader
# ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
9944
Name: OVN_Northbound
Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
Server ID: 9944 (99443341-5656-464d-b242-85bb16338570)
Address: ssl:170.168.0.4:6643
Status: cluster member
Role: leader
Term: 9
Leader: self
Vote: self

Last Election started 66169932 ms ago, reason: leadership_transfer
Last Election won: 66169930 ms ago
Election timer: 60000
Log: [66, 67]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 <-6aee <-7b89 ->7b89
Disconnections: 1
Servers:
    9944 (9944 at ssl:170.168.0.4:6643) (self) next_index=66 match_index=66
    6aee (6aee at ssl:170.168.0.2:6643) next_index=67 match_index=66
last msg 7857 ms ago
    7b89 (7b89 at ssl:170.168.0.3:6643) next_index=67 match_index=66
last msg 7857 ms ago

#non-leader 1
# ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
6aee
Name: OVN_Northbound
Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
Server ID: 6aee (6aee85c6-bd3e-45d5-896e-264ed7eaec00)
Address: ssl:170.168.0.2:6643
Status: cluster member
Role: follower
Term: 9
Leader: 9944
Vote: 9944

Last Election started 66336770 ms ago, reason: leadership_transfer
Last Election won: 66336767 ms ago
Election timer: 60000
Log: [67, 67]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-7b89 ->7b89 <-9944 ->9944
Disconnections: 0
Servers:
    9944 (9944 at ssl:170.168.0.4:6643) last msg 11573 ms ago
    6aee (6aee at ssl:170.168.0.2:6643) (self)
    7b89 (7b89 at ssl:170.168.0.3:6643) last msg 66173010 ms ago

#non-leader 2
# ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
7b89
Name: OVN_Northbound
Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
Server ID: 7b89 (7b892543-9c2f-43bc-b62c-2941491dbe56)
Address: ssl:170.168.0.3:6643
Status: cluster member
Role: follower
Term: 9
Leader: 9944
Vote: 9944

Election timer: 60000
Log: [66, 67]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 <-6aee ->9944 <-9944
Disconnections: 1
Servers:
    9944 (9944 at ssl:170.168.0.4:6643) last msg 32288 ms ago
    6aee (6aee at ssl:170.168.0.2:6643) last msg 66228156 ms ago
    7b89 (7b89 at ssl:170.168.0.3:6643) (self)

Thank you

Regards,

Tiago Pires

>
> Thanks a lot,
> Felix
>
> >
> > Regards,
> >
> > Tiago Pires
> >
> > --
> >
> >
> >
> >
> > _‘Esta mensagem é direcionada apenas para os endereços constantes no
> > cabeçalho inicial. Se você não está listado nos endereços constantes no
> > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
> > imediatamente anuladas e proibidas’._
> >
> >
> > * **‘Apesar do Magazine Luiza tomar
> > todas as precauções razoáveis para assegurar que nenhum vírus esteja
> > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por
> > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> >
> >
> >
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
  • [ovs-disc... Tiago Pires via discuss
    • Re: ... Felix Huettner via discuss
      • ... Tiago Pires via discuss
        • ... Tiago Pires via discuss
          • ... Felix Huettner via discuss
            • ... Tiago Pires via discuss
              • ... Tiago Pires via discuss
                • ... Alin Serdean via discuss
                • ... Tiago Pires via discuss
                • ... Alin Serdean via discuss
                • ... Felix Huettner via discuss
                • ... Tiago Pires via discuss
                • ... Lucas Vargas Dias (Dev - Cloud IaaS Network R&D) via discuss

Reply via email to