On Tue, Feb 25, 2025 at 11:24:34AM -0300, Tiago Pires via discuss wrote:
> Hi Felix,

Hi Tiago,

> 
> The local leader has these append messages before send the reply(next
> below log message):
> 2025-02-25T14:03:47.776Z|00764|jsonrpc|DBG|ssl:170.168.0.X:39452: send
> notification, method="append_request",
> params=[{"cluster":"56b3aab6-476f-4ce1-96b9-1588dd4176c9","comment":"heartbeat","from":"11a8329d-bb6f-4e76-849b-090be09c030d","leader_commit":57,"log":[],"prev_log_index":57,"prev_log_term":2,"term":2,"to":"3967f0d3-ed57-4433-861b-e1548d78639f"}]

So if i read the code correctly this is a regular raft heartbeat. That
is send every "election_timer/3". Based on the election timer you
provided below that should then happen every 20 seconds.

> 
> Here is the initial of the message with the whole database as answer:
> 2025-02-25T14:03:53.469Z|00765|jsonrpc|DBG|ssl:170.168.0.X:51420: send
> reply, 
> result=[false,"0e044970-54c2-4918-8496-0ad6bb3d5f45",{"ACL":{"001cc1e8-2b1c-4935-addf-6ca20ad45e21":{"initial":{"action":"allow-related",

I assume the IP is the same as the one in the log above? So same system
just with different ports?

If yes then maybe this is some other process on the remote host that for
whatever reason dumps the whole northbound database?

At least with the "initial" string in there it looks like a monitor
request that has just been sent. If it has no filtering it would get the
whole database.

So i would propose to search if this IP and port are the same all the
time and then find out what process that actually is.

> 
> Could it be something to investigate?
> 
> Regards,
> 
> Tiago Pires
> 
> On Tue, Feb 25, 2025 at 10:26 AM Tiago Pires <tiago.pi...@luizalabs.com> 
> wrote:
> >
> > On Tue, Feb 25, 2025 at 7:21 AM Felix Huettner
> > <felix.huettner@stackit.cloud> wrote:
> > >
> > > On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via discuss wrote:
> > > > Hi all,
> > > >
> > > > I have an OVN Central cluster where the leader of the ovsdb NB started
> > > > to use 100% of CPU load most of the time:
> > > >
> > > > 206 root      20   0   11.6g   4.7g   7172 R 106.7   0.3   2059:59
> > > > ovsdb-server -vconsole:off -vfile:info
> > > > --log-file=/var/log/ovn/ovsdb-server-nb.log
> > > >
> > > > While in 100% of CPU the read and write operations of the NB cluster
> > > > is impacted. Doing a debug when there is this increase of CPU load, I
> > > > can see a jsonrpc reply to a member of the cluster with the size of
> > > > 460MB, almost the same size as the NB database. I set up an
> > > > ovn-fake-multinode cluster and imported this database there and the
> > > > behavior is still the same.
> > > > At least the leader is not changing frequently since the election
> > > > timer is in 60secs.
> > > > And I have already tested with OVN 24.03 and no luck, same behavior.
> > >
> > > Hi Tiago,
> > >
> > > so if i get that correctly a non-leader member of the raft cluster
> > > regularly requests the whole database content.
> > > How often does that happen and can you correlate that with anything on
> > > that non-leader member? Maybe that member crashes or gets restarted for
> > > some reason?
> > >
> > > Note that the OVN version does not necessarily say anything about the
> > > OVS version. And the ovs version is what provides the code of the ovsdb
> > > server. So that version would be interesting as well.
> > >
> >
> > Hi Felix,
> >
> > You got well, in this scenario both non-leader of the raft cluster.
> > In the leader the jsonrpc reply can happen to both non-leader and it
> > happens around each 10secs.
> > I checked the non-leaders and their ovsdb processes are not crashing
> > or getting restarted.
> > The OVS version tested is 3.3.4.
> >
> > > >
> > > > The coverage figures are not so well clear to me:
> > > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show
> > > > Event coverage, avg rate over last: 5 seconds, last minute, last hour,
> > > >  hash=6087dcfb:
> > > > raft_entry_serialize       0.0/sec     0.000/sec        0.0000/sec   
> > > > total: 59
> > > > hmap_pathological          5.8/sec     3.667/sec        3.5750/sec
> > > > total: 585411
> > > > hmap_expand              79729.0/sec 53153.200/sec    51825.3172/sec
> > > > total: 8484601546
> > > > hmap_reserve               0.0/sec     0.000/sec        0.0000/sec   
> > > > total: 48
> > > > lockfile_lock              0.0/sec     0.000/sec        0.0000/sec   
> > > > total: 1
> > > > poll_create_node           3.6/sec     4.317/sec        4.4372/sec
> > > > total: 3587083
> > > > poll_zero_timeout          0.6/sec     0.150/sec        0.1286/sec
> > > > total: 105735
> > > > seq_change                 0.6/sec     0.417/sec        0.4158/sec
> > > > total: 375960
> > > > pstream_open               0.0/sec     0.000/sec        0.0000/sec   
> > > > total: 4
> > > > stream_open                0.0/sec     0.000/sec        0.0000/sec   
> > > > total: 3
> > > > unixctl_received           0.0/sec     0.017/sec        0.0003/sec   
> > > > total: 11
> > > > unixctl_replied            0.0/sec     0.017/sec        0.0003/sec   
> > > > total: 11
> > > > util_xalloc              3427998.6/sec 2285349.950/sec
> > > > 1035236.3394/sec   total: 364876387809
> > > > 100 events never hit
> > > >
> > > > Do you guys have any other way to debug it?
> > >
> > > Can you share the cluster status of both the leader and the node that
> > > always requests the database? maybe that helps.
> > >
> > Below are the cluster status from each node:
> >
> > #leader
> > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> > 9944
> > Name: OVN_Northbound
> > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> > Server ID: 9944 (99443341-5656-464d-b242-85bb16338570)
> > Address: ssl:170.168.0.4:6643
> > Status: cluster member
> > Role: leader
> > Term: 9
> > Leader: self
> > Vote: self
> >
> > Last Election started 66169932 ms ago, reason: leadership_transfer
> > Last Election won: 66169930 ms ago
> > Election timer: 60000
> > Log: [66, 67]
> > Entries not yet committed: 0
> > Entries not yet applied: 0
> > Connections: ->0000 <-6aee <-7b89 ->7b89
> > Disconnections: 1
> > Servers:
> >     9944 (9944 at ssl:170.168.0.4:6643) (self) next_index=66 match_index=66
> >     6aee (6aee at ssl:170.168.0.2:6643) next_index=67 match_index=66
> > last msg 7857 ms ago
> >     7b89 (7b89 at ssl:170.168.0.3:6643) next_index=67 match_index=66
> > last msg 7857 ms ago
> >
> > #non-leader 1
> > # ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> > 6aee
> > Name: OVN_Northbound
> > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> > Server ID: 6aee (6aee85c6-bd3e-45d5-896e-264ed7eaec00)
> > Address: ssl:170.168.0.2:6643
> > Status: cluster member
> > Role: follower
> > Term: 9
> > Leader: 9944
> > Vote: 9944
> >
> > Last Election started 66336770 ms ago, reason: leadership_transfer
> > Last Election won: 66336767 ms ago
> > Election timer: 60000
> > Log: [67, 67]
> > Entries not yet committed: 0
> > Entries not yet applied: 0
> > Connections: <-7b89 ->7b89 <-9944 ->9944
> > Disconnections: 0
> > Servers:
> >     9944 (9944 at ssl:170.168.0.4:6643) last msg 11573 ms ago
> >     6aee (6aee at ssl:170.168.0.2:6643) (self)
> >     7b89 (7b89 at ssl:170.168.0.3:6643) last msg 66173010 ms ago
> >
> > #non-leader 2
> > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> > 7b89
> > Name: OVN_Northbound
> > Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> > Server ID: 7b89 (7b892543-9c2f-43bc-b62c-2941491dbe56)
> > Address: ssl:170.168.0.3:6643
> > Status: cluster member
> > Role: follower
> > Term: 9
> > Leader: 9944
> > Vote: 9944
> >
> > Election timer: 60000
> > Log: [66, 67]
> > Entries not yet committed: 0
> > Entries not yet applied: 0
> > Connections: ->0000 <-6aee ->9944 <-9944
> > Disconnections: 1
> > Servers:
> >     9944 (9944 at ssl:170.168.0.4:6643) last msg 32288 ms ago
> >     6aee (6aee at ssl:170.168.0.2:6643) last msg 66228156 ms ago
> >     7b89 (7b89 at ssl:170.168.0.3:6643) (self)

All of these look from my perspective like a normal healthy cluster.

Lets see if the above helps in any way.

Thanks,
Felix

> >
> > Thank you
> >
> > Regards,
> >
> > Tiago Pires
> >
> > >
> > > Thanks a lot,
> > > Felix
> > >
> > > >
> > > > Regards,
> > > >
> > > > Tiago Pires
> > > >
> > > > --
> > > >
> > > >
> > > >
> > > >
> > > > _‘Esta mensagem é direcionada apenas para os endereços constantes no
> > > > cabeçalho inicial. Se você não está listado nos endereços constantes no
> > > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> > > > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas 
> > > > estão
> > > > imediatamente anuladas e proibidas’._
> > > >
> > > >
> > > > * **‘Apesar do Magazine Luiza tomar
> > > > todas as precauções razoáveis para assegurar que nenhum vírus esteja
> > > > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade 
> > > > por
> > > > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > discuss mailing list
> > > > disc...@openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> 
> -- 
> 
> 
> 
> 
> _‘Esta mensagem é direcionada apenas para os endereços constantes no 
> cabeçalho inicial. Se você não está listado nos endereços constantes no 
> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
> imediatamente anuladas e proibidas’._
> 
> 
> * **‘Apesar do Magazine Luiza tomar 
> todas as precauções razoáveis para assegurar que nenhum vírus esteja 
> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> 
> 
> 
> _______________________________________________
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
  • [ovs-disc... Tiago Pires via discuss
    • Re: ... Felix Huettner via discuss
      • ... Tiago Pires via discuss
        • ... Tiago Pires via discuss
          • ... Felix Huettner via discuss
            • ... Tiago Pires via discuss
              • ... Tiago Pires via discuss
                • ... Alin Serdean via discuss
                • ... Tiago Pires via discuss
                • ... Alin Serdean via discuss
                • ... Felix Huettner via discuss
                • ... Tiago Pires via discuss
                • ... Lucas Vargas Dias (Dev - Cloud IaaS Network R&D) via discuss

Reply via email to