On Tue, Feb 25, 2025 at 7:21 AM Felix Huettner <felix.huettner@stackit.cloud> wrote: > > On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via discuss wrote: > > Hi all, > > > > I have an OVN Central cluster where the leader of the ovsdb NB started > > to use 100% of CPU load most of the time: > > > > 206 root 20 0 11.6g 4.7g 7172 R 106.7 0.3 2059:59 > > ovsdb-server -vconsole:off -vfile:info > > --log-file=/var/log/ovn/ovsdb-server-nb.log > > > > While in 100% of CPU the read and write operations of the NB cluster > > is impacted. Doing a debug when there is this increase of CPU load, I > > can see a jsonrpc reply to a member of the cluster with the size of > > 460MB, almost the same size as the NB database. I set up an > > ovn-fake-multinode cluster and imported this database there and the > > behavior is still the same. > > At least the leader is not changing frequently since the election > > timer is in 60secs. > > And I have already tested with OVN 24.03 and no luck, same behavior. > > Hi Tiago, > > so if i get that correctly a non-leader member of the raft cluster > regularly requests the whole database content. > How often does that happen and can you correlate that with anything on > that non-leader member? Maybe that member crashes or gets restarted for > some reason? > > Note that the OVN version does not necessarily say anything about the > OVS version. And the ovs version is what provides the code of the ovsdb > server. So that version would be interesting as well. >
Hi Felix, You got well, in this scenario both non-leader of the raft cluster. In the leader the jsonrpc reply can happen to both non-leader and it happens around each 10secs. I checked the non-leaders and their ovsdb processes are not crashing or getting restarted. The OVS version tested is 3.3.4. > > > > The coverage figures are not so well clear to me: > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show > > Event coverage, avg rate over last: 5 seconds, last minute, last hour, > > hash=6087dcfb: > > raft_entry_serialize 0.0/sec 0.000/sec 0.0000/sec total: > > 59 > > hmap_pathological 5.8/sec 3.667/sec 3.5750/sec > > total: 585411 > > hmap_expand 79729.0/sec 53153.200/sec 51825.3172/sec > > total: 8484601546 > > hmap_reserve 0.0/sec 0.000/sec 0.0000/sec total: > > 48 > > lockfile_lock 0.0/sec 0.000/sec 0.0000/sec total: > > 1 > > poll_create_node 3.6/sec 4.317/sec 4.4372/sec > > total: 3587083 > > poll_zero_timeout 0.6/sec 0.150/sec 0.1286/sec > > total: 105735 > > seq_change 0.6/sec 0.417/sec 0.4158/sec > > total: 375960 > > pstream_open 0.0/sec 0.000/sec 0.0000/sec total: > > 4 > > stream_open 0.0/sec 0.000/sec 0.0000/sec total: > > 3 > > unixctl_received 0.0/sec 0.017/sec 0.0003/sec total: > > 11 > > unixctl_replied 0.0/sec 0.017/sec 0.0003/sec total: > > 11 > > util_xalloc 3427998.6/sec 2285349.950/sec > > 1035236.3394/sec total: 364876387809 > > 100 events never hit > > > > Do you guys have any other way to debug it? > > Can you share the cluster status of both the leader and the node that > always requests the database? maybe that helps. > Below are the cluster status from each node: #leader # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 9944 Name: OVN_Northbound Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5) Server ID: 9944 (99443341-5656-464d-b242-85bb16338570) Address: ssl:170.168.0.4:6643 Status: cluster member Role: leader Term: 9 Leader: self Vote: self Last Election started 66169932 ms ago, reason: leadership_transfer Last Election won: 66169930 ms ago Election timer: 60000 Log: [66, 67] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 <-6aee <-7b89 ->7b89 Disconnections: 1 Servers: 9944 (9944 at ssl:170.168.0.4:6643) (self) next_index=66 match_index=66 6aee (6aee at ssl:170.168.0.2:6643) next_index=67 match_index=66 last msg 7857 ms ago 7b89 (7b89 at ssl:170.168.0.3:6643) next_index=67 match_index=66 last msg 7857 ms ago #non-leader 1 # ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 6aee Name: OVN_Northbound Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5) Server ID: 6aee (6aee85c6-bd3e-45d5-896e-264ed7eaec00) Address: ssl:170.168.0.2:6643 Status: cluster member Role: follower Term: 9 Leader: 9944 Vote: 9944 Last Election started 66336770 ms ago, reason: leadership_transfer Last Election won: 66336767 ms ago Election timer: 60000 Log: [67, 67] Entries not yet committed: 0 Entries not yet applied: 0 Connections: <-7b89 ->7b89 <-9944 ->9944 Disconnections: 0 Servers: 9944 (9944 at ssl:170.168.0.4:6643) last msg 11573 ms ago 6aee (6aee at ssl:170.168.0.2:6643) (self) 7b89 (7b89 at ssl:170.168.0.3:6643) last msg 66173010 ms ago #non-leader 2 # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound 7b89 Name: OVN_Northbound Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5) Server ID: 7b89 (7b892543-9c2f-43bc-b62c-2941491dbe56) Address: ssl:170.168.0.3:6643 Status: cluster member Role: follower Term: 9 Leader: 9944 Vote: 9944 Election timer: 60000 Log: [66, 67] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->0000 <-6aee ->9944 <-9944 Disconnections: 1 Servers: 9944 (9944 at ssl:170.168.0.4:6643) last msg 32288 ms ago 6aee (6aee at ssl:170.168.0.2:6643) last msg 66228156 ms ago 7b89 (7b89 at ssl:170.168.0.3:6643) (self) Thank you Regards, Tiago Pires > > Thanks a lot, > Felix > > > > > Regards, > > > > Tiago Pires > > > > -- > > > > > > > > > > _‘Esta mensagem é direcionada apenas para os endereços constantes no > > cabeçalho inicial. Se você não está listado nos endereços constantes no > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa > > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão > > imediatamente anuladas e proibidas’._ > > > > > > * **‘Apesar do Magazine Luiza tomar > > todas as precauções razoáveis para assegurar que nenhum vírus esteja > > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por > > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* > > > > > > > > _______________________________________________ > > discuss mailing list > > disc...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss