On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via discuss wrote: > Hi all, > > I have an OVN Central cluster where the leader of the ovsdb NB started > to use 100% of CPU load most of the time: > > 206 root 20 0 11.6g 4.7g 7172 R 106.7 0.3 2059:59 > ovsdb-server -vconsole:off -vfile:info > --log-file=/var/log/ovn/ovsdb-server-nb.log > > While in 100% of CPU the read and write operations of the NB cluster > is impacted. Doing a debug when there is this increase of CPU load, I > can see a jsonrpc reply to a member of the cluster with the size of > 460MB, almost the same size as the NB database. I set up an > ovn-fake-multinode cluster and imported this database there and the > behavior is still the same. > At least the leader is not changing frequently since the election > timer is in 60secs. > And I have already tested with OVN 24.03 and no luck, same behavior.
Hi Tiago, so if i get that correctly a non-leader member of the raft cluster regularly requests the whole database content. How often does that happen and can you correlate that with anything on that non-leader member? Maybe that member crashes or gets restarted for some reason? Note that the OVN version does not necessarily say anything about the OVS version. And the ovs version is what provides the code of the ovsdb server. So that version would be interesting as well. > > The coverage figures are not so well clear to me: > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show > Event coverage, avg rate over last: 5 seconds, last minute, last hour, > hash=6087dcfb: > raft_entry_serialize 0.0/sec 0.000/sec 0.0000/sec total: 59 > hmap_pathological 5.8/sec 3.667/sec 3.5750/sec > total: 585411 > hmap_expand 79729.0/sec 53153.200/sec 51825.3172/sec > total: 8484601546 > hmap_reserve 0.0/sec 0.000/sec 0.0000/sec total: 48 > lockfile_lock 0.0/sec 0.000/sec 0.0000/sec total: 1 > poll_create_node 3.6/sec 4.317/sec 4.4372/sec > total: 3587083 > poll_zero_timeout 0.6/sec 0.150/sec 0.1286/sec > total: 105735 > seq_change 0.6/sec 0.417/sec 0.4158/sec > total: 375960 > pstream_open 0.0/sec 0.000/sec 0.0000/sec total: 4 > stream_open 0.0/sec 0.000/sec 0.0000/sec total: 3 > unixctl_received 0.0/sec 0.017/sec 0.0003/sec total: 11 > unixctl_replied 0.0/sec 0.017/sec 0.0003/sec total: 11 > util_xalloc 3427998.6/sec 2285349.950/sec > 1035236.3394/sec total: 364876387809 > 100 events never hit > > Do you guys have any other way to debug it? Can you share the cluster status of both the leader and the node that always requests the database? maybe that helps. Thanks a lot, Felix > > Regards, > > Tiago Pires > > -- > > > > > _‘Esta mensagem é direcionada apenas para os endereços constantes no > cabeçalho inicial. Se você não está listado nos endereços constantes no > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão > imediatamente anuladas e proibidas’._ > > > * **‘Apesar do Magazine Luiza tomar > todas as precauções razoáveis para assegurar que nenhum vírus esteja > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* > > > > _______________________________________________ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss