On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via discuss wrote:
> Hi all,
> 
> I have an OVN Central cluster where the leader of the ovsdb NB started
> to use 100% of CPU load most of the time:
> 
> 206 root      20   0   11.6g   4.7g   7172 R 106.7   0.3   2059:59
> ovsdb-server -vconsole:off -vfile:info
> --log-file=/var/log/ovn/ovsdb-server-nb.log
> 
> While in 100% of CPU the read and write operations of the NB cluster
> is impacted. Doing a debug when there is this increase of CPU load, I
> can see a jsonrpc reply to a member of the cluster with the size of
> 460MB, almost the same size as the NB database. I set up an
> ovn-fake-multinode cluster and imported this database there and the
> behavior is still the same.
> At least the leader is not changing frequently since the election
> timer is in 60secs.
> And I have already tested with OVN 24.03 and no luck, same behavior.

Hi Tiago,

so if i get that correctly a non-leader member of the raft cluster
regularly requests the whole database content.
How often does that happen and can you correlate that with anything on
that non-leader member? Maybe that member crashes or gets restarted for
some reason?

Note that the OVN version does not necessarily say anything about the
OVS version. And the ovs version is what provides the code of the ovsdb
server. So that version would be interesting as well.

> 
> The coverage figures are not so well clear to me:
> # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show
> Event coverage, avg rate over last: 5 seconds, last minute, last hour,
>  hash=6087dcfb:
> raft_entry_serialize       0.0/sec     0.000/sec        0.0000/sec   total: 59
> hmap_pathological          5.8/sec     3.667/sec        3.5750/sec
> total: 585411
> hmap_expand              79729.0/sec 53153.200/sec    51825.3172/sec
> total: 8484601546
> hmap_reserve               0.0/sec     0.000/sec        0.0000/sec   total: 48
> lockfile_lock              0.0/sec     0.000/sec        0.0000/sec   total: 1
> poll_create_node           3.6/sec     4.317/sec        4.4372/sec
> total: 3587083
> poll_zero_timeout          0.6/sec     0.150/sec        0.1286/sec
> total: 105735
> seq_change                 0.6/sec     0.417/sec        0.4158/sec
> total: 375960
> pstream_open               0.0/sec     0.000/sec        0.0000/sec   total: 4
> stream_open                0.0/sec     0.000/sec        0.0000/sec   total: 3
> unixctl_received           0.0/sec     0.017/sec        0.0003/sec   total: 11
> unixctl_replied            0.0/sec     0.017/sec        0.0003/sec   total: 11
> util_xalloc              3427998.6/sec 2285349.950/sec
> 1035236.3394/sec   total: 364876387809
> 100 events never hit
> 
> Do you guys have any other way to debug it?

Can you share the cluster status of both the leader and the node that
always requests the database? maybe that helps.


Thanks a lot,
Felix

> 
> Regards,
> 
> Tiago Pires
> 
> -- 
> 
> 
> 
> 
> _‘Esta mensagem é direcionada apenas para os endereços constantes no 
> cabeçalho inicial. Se você não está listado nos endereços constantes no 
> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
> imediatamente anuladas e proibidas’._
> 
> 
> * **‘Apesar do Magazine Luiza tomar 
> todas as precauções razoáveis para assegurar que nenhum vírus esteja 
> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> 
> 
> 
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to