Afair that also verifies the integrity of the DB. The only commands that I can think that might trim the content is backup and restore.
On Wed, Feb 26, 2025 at 7:32 PM Tiago Pires <tiago.pi...@luizalabs.com> wrote: > On Wed, Feb 26, 2025 at 3:23 PM Alin Serdean <alinserd...@gmail.com> > wrote: > > > > Hi Tiago, > > > > If the cluster is accessible you can use something like the following: > > > > ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound > > ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound > > Alin, > > I mean, how to check if the NB database content is not breaking the > ovsdb-server NB, making it to use 100% CPU. > > Regards, > > Tiago Pires > > > > please note that the location might be in a different location in your > case. > > > > Alin. > > > > On Wed, Feb 26, 2025 at 6:48 PM Tiago Pires via discuss < > ovs-discuss@openvswitch.org> wrote: > >> > >> Hi all, > >> > >> I have an update, If I start in standalone mode using this database, > >> that comes up in less than 30 seconds. > >> But if I try to use the database in cluster mode it takes more than 20 > >> minutes to come up and after that the leader will remain in 100% of > >> load. > >> In cluster mode, the ovnnb_db.ctl and ovnnb_db.sock are not created at > >> the moment that ovsdb-server NB is started, it is created only after 5 > >> minutes. > >> It seems something on the DB is wrong and makes the OVSDB NB take this > >> long time to start. > >> > >> Is there a way to check if the database is healthy? > >> > >> Regards, > >> > >> Tiago Pires > >> > >> On Tue, Feb 25, 2025 at 12:17 PM Tiago Pires <tiago.pi...@luizalabs.com> > wrote: > >> > > >> > On Tue, Feb 25, 2025 at 11:43 AM Felix Huettner > >> > <felix.huettner@stackit.cloud> wrote: > >> > > > >> > > On Tue, Feb 25, 2025 at 11:24:34AM -0300, Tiago Pires via discuss > wrote: > >> > > > Hi Felix, > >> > > > >> > > Hi Tiago, > >> > > > >> > > > > >> > > > The local leader has these append messages before send the > reply(next > >> > > > below log message): > >> > > > 2025-02-25T14:03:47.776Z|00764|jsonrpc|DBG|ssl:170.168.0.X:39452: > send > >> > > > notification, method="append_request", > >> > > > > params=[{"cluster":"56b3aab6-476f-4ce1-96b9-1588dd4176c9","comment":"heartbeat","from":"11a8329d-bb6f-4e76-849b-090be09c030d","leader_commit":57,"log":[],"prev_log_index":57,"prev_log_term":2,"term":2,"to":"3967f0d3-ed57-4433-861b-e1548d78639f"}] > >> > > > >> > > So if i read the code correctly this is a regular raft heartbeat. > That > >> > > is send every "election_timer/3". Based on the election timer you > >> > > provided below that should then happen every 20 seconds. > >> > > > >> > > > > >> > > > Here is the initial of the message with the whole database as > answer: > >> > > > 2025-02-25T14:03:53.469Z|00765|jsonrpc|DBG|ssl:170.168.0.X:51420: > send > >> > > > reply, > result=[false,"0e044970-54c2-4918-8496-0ad6bb3d5f45",{"ACL":{"001cc1e8-2b1c-4935-addf-6ca20ad45e21":{"initial":{"action":"allow-related", > >> > > > >> > > I assume the IP is the same as the one in the log above? So same > system > >> > > just with different ports? > >> > > > >> > > If yes then maybe this is some other process on the remote host > that for > >> > > whatever reason dumps the whole northbound database? > >> > > > >> > > At least with the "initial" string in there it looks like a monitor > >> > > request that has just been sent. If it has no filtering it would > get the > >> > > whole database. > >> > > > >> > > >> > Hi Felix, > >> > > >> > The IP is from a non-leader member of the cluster that is running the > >> > regular OVN cluster processes, the output log is from an > >> > ovn-fake-multinode setup that I set up to reproduce the issue. > >> > So it is a fresh setup using the DBs that has this strange behavior. I > >> > already destroyed and recreated this setup a few times with the same > >> > behavior. > >> > Maybe if you have time I can share the DB with you for you to take a > >> > look, the DB is from a Lab env(non-sensitive data) but I'm afraid that > >> > it can happen in a production env without knowing what is happening. > >> > > >> > Let me know if you agree with that. > >> > > >> > Thanks for your time. > >> > > >> > Tiago Pires > >> > > >> > > So i would propose to search if this IP and port are the same all > the > >> > > time and then find out what process that actually is. > >> > > > >> > > > > >> > > > Could it be something to investigate? > >> > > > > >> > > > Regards, > >> > > > > >> > > > Tiago Pires > >> > > > > >> > > > On Tue, Feb 25, 2025 at 10:26 AM Tiago Pires < > tiago.pi...@luizalabs.com> wrote: > >> > > > > > >> > > > > On Tue, Feb 25, 2025 at 7:21 AM Felix Huettner > >> > > > > <felix.huettner@stackit.cloud> wrote: > >> > > > > > > >> > > > > > On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via > discuss wrote: > >> > > > > > > Hi all, > >> > > > > > > > >> > > > > > > I have an OVN Central cluster where the leader of the ovsdb > NB started > >> > > > > > > to use 100% of CPU load most of the time: > >> > > > > > > > >> > > > > > > 206 root 20 0 11.6g 4.7g 7172 R 106.7 0.3 > 2059:59 > >> > > > > > > ovsdb-server -vconsole:off -vfile:info > >> > > > > > > --log-file=/var/log/ovn/ovsdb-server-nb.log > >> > > > > > > > >> > > > > > > While in 100% of CPU the read and write operations of the > NB cluster > >> > > > > > > is impacted. Doing a debug when there is this increase of > CPU load, I > >> > > > > > > can see a jsonrpc reply to a member of the cluster with the > size of > >> > > > > > > 460MB, almost the same size as the NB database. I set up an > >> > > > > > > ovn-fake-multinode cluster and imported this database there > and the > >> > > > > > > behavior is still the same. > >> > > > > > > At least the leader is not changing frequently since the > election > >> > > > > > > timer is in 60secs. > >> > > > > > > And I have already tested with OVN 24.03 and no luck, same > behavior. > >> > > > > > > >> > > > > > Hi Tiago, > >> > > > > > > >> > > > > > so if i get that correctly a non-leader member of the raft > cluster > >> > > > > > regularly requests the whole database content. > >> > > > > > How often does that happen and can you correlate that with > anything on > >> > > > > > that non-leader member? Maybe that member crashes or gets > restarted for > >> > > > > > some reason? > >> > > > > > > >> > > > > > Note that the OVN version does not necessarily say anything > about the > >> > > > > > OVS version. And the ovs version is what provides the code of > the ovsdb > >> > > > > > server. So that version would be interesting as well. > >> > > > > > > >> > > > > > >> > > > > Hi Felix, > >> > > > > > >> > > > > You got well, in this scenario both non-leader of the raft > cluster. > >> > > > > In the leader the jsonrpc reply can happen to both non-leader > and it > >> > > > > happens around each 10secs. > >> > > > > I checked the non-leaders and their ovsdb processes are not > crashing > >> > > > > or getting restarted. > >> > > > > The OVS version tested is 3.3.4. > >> > > > > > >> > > > > > > > >> > > > > > > The coverage figures are not so well clear to me: > >> > > > > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show > >> > > > > > > Event coverage, avg rate over last: 5 seconds, last minute, > last hour, > >> > > > > > > hash=6087dcfb: > >> > > > > > > raft_entry_serialize 0.0/sec 0.000/sec > 0.0000/sec total: 59 > >> > > > > > > hmap_pathological 5.8/sec 3.667/sec > 3.5750/sec > >> > > > > > > total: 585411 > >> > > > > > > hmap_expand 79729.0/sec 53153.200/sec > 51825.3172/sec > >> > > > > > > total: 8484601546 > >> > > > > > > hmap_reserve 0.0/sec 0.000/sec > 0.0000/sec total: 48 > >> > > > > > > lockfile_lock 0.0/sec 0.000/sec > 0.0000/sec total: 1 > >> > > > > > > poll_create_node 3.6/sec 4.317/sec > 4.4372/sec > >> > > > > > > total: 3587083 > >> > > > > > > poll_zero_timeout 0.6/sec 0.150/sec > 0.1286/sec > >> > > > > > > total: 105735 > >> > > > > > > seq_change 0.6/sec 0.417/sec > 0.4158/sec > >> > > > > > > total: 375960 > >> > > > > > > pstream_open 0.0/sec 0.000/sec > 0.0000/sec total: 4 > >> > > > > > > stream_open 0.0/sec 0.000/sec > 0.0000/sec total: 3 > >> > > > > > > unixctl_received 0.0/sec 0.017/sec > 0.0003/sec total: 11 > >> > > > > > > unixctl_replied 0.0/sec 0.017/sec > 0.0003/sec total: 11 > >> > > > > > > util_xalloc 3427998.6/sec 2285349.950/sec > >> > > > > > > 1035236.3394/sec total: 364876387809 > >> > > > > > > 100 events never hit > >> > > > > > > > >> > > > > > > Do you guys have any other way to debug it? > >> > > > > > > >> > > > > > Can you share the cluster status of both the leader and the > node that > >> > > > > > always requests the database? maybe that helps. > >> > > > > > > >> > > > > Below are the cluster status from each node: > >> > > > > > >> > > > > #leader > >> > > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status > OVN_Northbound > >> > > > > 9944 > >> > > > > Name: OVN_Northbound > >> > > > > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5) > >> > > > > Server ID: 9944 (99443341-5656-464d-b242-85bb16338570) > >> > > > > Address: ssl:170.168.0.4:6643 > >> > > > > Status: cluster member > >> > > > > Role: leader > >> > > > > Term: 9 > >> > > > > Leader: self > >> > > > > Vote: self > >> > > > > > >> > > > > Last Election started 66169932 ms ago, reason: > leadership_transfer > >> > > > > Last Election won: 66169930 ms ago > >> > > > > Election timer: 60000 > >> > > > > Log: [66, 67] > >> > > > > Entries not yet committed: 0 > >> > > > > Entries not yet applied: 0 > >> > > > > Connections: ->0000 <-6aee <-7b89 ->7b89 > >> > > > > Disconnections: 1 > >> > > > > Servers: > >> > > > > 9944 (9944 at ssl:170.168.0.4:6643) (self) next_index=66 > match_index=66 > >> > > > > 6aee (6aee at ssl:170.168.0.2:6643) next_index=67 > match_index=66 > >> > > > > last msg 7857 ms ago > >> > > > > 7b89 (7b89 at ssl:170.168.0.3:6643) next_index=67 > match_index=66 > >> > > > > last msg 7857 ms ago > >> > > > > > >> > > > > #non-leader 1 > >> > > > > # ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status > OVN_Northbound > >> > > > > 6aee > >> > > > > Name: OVN_Northbound > >> > > > > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5) > >> > > > > Server ID: 6aee (6aee85c6-bd3e-45d5-896e-264ed7eaec00) > >> > > > > Address: ssl:170.168.0.2:6643 > >> > > > > Status: cluster member > >> > > > > Role: follower > >> > > > > Term: 9 > >> > > > > Leader: 9944 > >> > > > > Vote: 9944 > >> > > > > > >> > > > > Last Election started 66336770 ms ago, reason: > leadership_transfer > >> > > > > Last Election won: 66336767 ms ago > >> > > > > Election timer: 60000 > >> > > > > Log: [67, 67] > >> > > > > Entries not yet committed: 0 > >> > > > > Entries not yet applied: 0 > >> > > > > Connections: <-7b89 ->7b89 <-9944 ->9944 > >> > > > > Disconnections: 0 > >> > > > > Servers: > >> > > > > 9944 (9944 at ssl:170.168.0.4:6643) last msg 11573 ms ago > >> > > > > 6aee (6aee at ssl:170.168.0.2:6643) (self) > >> > > > > 7b89 (7b89 at ssl:170.168.0.3:6643) last msg 66173010 ms > ago > >> > > > > > >> > > > > #non-leader 2 > >> > > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status > OVN_Northbound > >> > > > > 7b89 > >> > > > > Name: OVN_Northbound > >> > > > > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5) > >> > > > > Server ID: 7b89 (7b892543-9c2f-43bc-b62c-2941491dbe56) > >> > > > > Address: ssl:170.168.0.3:6643 > >> > > > > Status: cluster member > >> > > > > Role: follower > >> > > > > Term: 9 > >> > > > > Leader: 9944 > >> > > > > Vote: 9944 > >> > > > > > >> > > > > Election timer: 60000 > >> > > > > Log: [66, 67] > >> > > > > Entries not yet committed: 0 > >> > > > > Entries not yet applied: 0 > >> > > > > Connections: ->0000 <-6aee ->9944 <-9944 > >> > > > > Disconnections: 1 > >> > > > > Servers: > >> > > > > 9944 (9944 at ssl:170.168.0.4:6643) last msg 32288 ms ago > >> > > > > 6aee (6aee at ssl:170.168.0.2:6643) last msg 66228156 ms > ago > >> > > > > 7b89 (7b89 at ssl:170.168.0.3:6643) (self) > >> > > > >> > > All of these look from my perspective like a normal healthy cluster. > >> > > > >> > > Lets see if the above helps in any way. > >> > > > >> > > Thanks, > >> > > Felix > >> > > > >> > > > > > >> > > > > Thank you > >> > > > > > >> > > > > Regards, > >> > > > > > >> > > > > Tiago Pires > >> > > > > > >> > > > > > > >> > > > > > Thanks a lot, > >> > > > > > Felix > >> > > > > > > >> > > > > > > > >> > > > > > > Regards, > >> > > > > > > > >> > > > > > > Tiago Pires > >> > > > > > > > >> > > > > > > -- > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > _‘Esta mensagem é direcionada apenas para os endereços > constantes no > >> > > > > > > cabeçalho inicial. Se você não está listado nos endereços > constantes no > >> > > > > > > cabeçalho, pedimos-lhe que desconsidere completamente o > conteúdo dessa > >> > > > > > > mensagem e cuja cópia, encaminhamento e/ou execução das > ações citadas estão > >> > > > > > > imediatamente anuladas e proibidas’._ > >> > > > > > > > >> > > > > > > > >> > > > > > > * **‘Apesar do Magazine Luiza tomar > >> > > > > > > todas as precauções razoáveis para assegurar que nenhum > vírus esteja > >> > > > > > > presente nesse e-mail, a empresa não poderá aceitar a > responsabilidade por > >> > > > > > > quaisquer perdas ou danos causados por esse e-mail ou por > seus anexos’.* > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > _______________________________________________ > >> > > > > > > discuss mailing list > >> > > > > > > disc...@openvswitch.org > >> > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > > > > >> > > > -- > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > _‘Esta mensagem é direcionada apenas para os endereços constantes > no > >> > > > cabeçalho inicial. Se você não está listado nos endereços > constantes no > >> > > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo > dessa > >> > > > mensagem e cuja cópia, encaminhamento e/ou execução das ações > citadas estão > >> > > > imediatamente anuladas e proibidas’._ > >> > > > > >> > > > > >> > > > * **‘Apesar do Magazine Luiza tomar > >> > > > todas as precauções razoáveis para assegurar que nenhum vírus > esteja > >> > > > presente nesse e-mail, a empresa não poderá aceitar a > responsabilidade por > >> > > > quaisquer perdas ou danos causados por esse e-mail ou por seus > anexos’.* > >> > > > > >> > > > > >> > > > > >> > > > _______________________________________________ > >> > > > discuss mailing list > >> > > > disc...@openvswitch.org > >> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > >> -- > >> > >> > >> > >> > >> _‘Esta mensagem é direcionada apenas para os endereços constantes no > >> cabeçalho inicial. Se você não está listado nos endereços constantes no > >> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa > >> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas > estão > >> imediatamente anuladas e proibidas’._ > >> > >> > >> * **‘Apesar do Magazine Luiza tomar > >> todas as precauções razoáveis para assegurar que nenhum vírus esteja > >> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade > por > >> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* > >> > >> > >> > >> _______________________________________________ > >> discuss mailing list > >> disc...@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > -- > > > > > _‘Esta mensagem é direcionada apenas para os endereços constantes no > cabeçalho inicial. Se você não está listado nos endereços constantes no > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas > estão > imediatamente anuladas e proibidas’._ > > > * **‘Apesar do Magazine Luiza tomar > todas as precauções razoáveis para assegurar que nenhum vírus esteja > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* > > > >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss