Hi Felix,

The local leader has these append messages before send the reply(next
below log message):
2025-02-25T14:03:47.776Z|00764|jsonrpc|DBG|ssl:170.168.0.X:39452: send
notification, method="append_request",
params=[{"cluster":"56b3aab6-476f-4ce1-96b9-1588dd4176c9","comment":"heartbeat","from":"11a8329d-bb6f-4e76-849b-090be09c030d","leader_commit":57,"log":[],"prev_log_index":57,"prev_log_term":2,"term":2,"to":"3967f0d3-ed57-4433-861b-e1548d78639f"}]

Here is the initial of the message with the whole database as answer:
2025-02-25T14:03:53.469Z|00765|jsonrpc|DBG|ssl:170.168.0.X:51420: send
reply, 
result=[false,"0e044970-54c2-4918-8496-0ad6bb3d5f45",{"ACL":{"001cc1e8-2b1c-4935-addf-6ca20ad45e21":{"initial":{"action":"allow-related",

Could it be something to investigate?

Regards,

Tiago Pires

On Tue, Feb 25, 2025 at 10:26 AM Tiago Pires <tiago.pi...@luizalabs.com> wrote:
>
> On Tue, Feb 25, 2025 at 7:21 AM Felix Huettner
> <felix.huettner@stackit.cloud> wrote:
> >
> > On Mon, Feb 24, 2025 at 05:44:02PM -0300, Tiago Pires via discuss wrote:
> > > Hi all,
> > >
> > > I have an OVN Central cluster where the leader of the ovsdb NB started
> > > to use 100% of CPU load most of the time:
> > >
> > > 206 root      20   0   11.6g   4.7g   7172 R 106.7   0.3   2059:59
> > > ovsdb-server -vconsole:off -vfile:info
> > > --log-file=/var/log/ovn/ovsdb-server-nb.log
> > >
> > > While in 100% of CPU the read and write operations of the NB cluster
> > > is impacted. Doing a debug when there is this increase of CPU load, I
> > > can see a jsonrpc reply to a member of the cluster with the size of
> > > 460MB, almost the same size as the NB database. I set up an
> > > ovn-fake-multinode cluster and imported this database there and the
> > > behavior is still the same.
> > > At least the leader is not changing frequently since the election
> > > timer is in 60secs.
> > > And I have already tested with OVN 24.03 and no luck, same behavior.
> >
> > Hi Tiago,
> >
> > so if i get that correctly a non-leader member of the raft cluster
> > regularly requests the whole database content.
> > How often does that happen and can you correlate that with anything on
> > that non-leader member? Maybe that member crashes or gets restarted for
> > some reason?
> >
> > Note that the OVN version does not necessarily say anything about the
> > OVS version. And the ovs version is what provides the code of the ovsdb
> > server. So that version would be interesting as well.
> >
>
> Hi Felix,
>
> You got well, in this scenario both non-leader of the raft cluster.
> In the leader the jsonrpc reply can happen to both non-leader and it
> happens around each 10secs.
> I checked the non-leaders and their ovsdb processes are not crashing
> or getting restarted.
> The OVS version tested is 3.3.4.
>
> > >
> > > The coverage figures are not so well clear to me:
> > > # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl coverage/show
> > > Event coverage, avg rate over last: 5 seconds, last minute, last hour,
> > >  hash=6087dcfb:
> > > raft_entry_serialize       0.0/sec     0.000/sec        0.0000/sec   
> > > total: 59
> > > hmap_pathological          5.8/sec     3.667/sec        3.5750/sec
> > > total: 585411
> > > hmap_expand              79729.0/sec 53153.200/sec    51825.3172/sec
> > > total: 8484601546
> > > hmap_reserve               0.0/sec     0.000/sec        0.0000/sec   
> > > total: 48
> > > lockfile_lock              0.0/sec     0.000/sec        0.0000/sec   
> > > total: 1
> > > poll_create_node           3.6/sec     4.317/sec        4.4372/sec
> > > total: 3587083
> > > poll_zero_timeout          0.6/sec     0.150/sec        0.1286/sec
> > > total: 105735
> > > seq_change                 0.6/sec     0.417/sec        0.4158/sec
> > > total: 375960
> > > pstream_open               0.0/sec     0.000/sec        0.0000/sec   
> > > total: 4
> > > stream_open                0.0/sec     0.000/sec        0.0000/sec   
> > > total: 3
> > > unixctl_received           0.0/sec     0.017/sec        0.0003/sec   
> > > total: 11
> > > unixctl_replied            0.0/sec     0.017/sec        0.0003/sec   
> > > total: 11
> > > util_xalloc              3427998.6/sec 2285349.950/sec
> > > 1035236.3394/sec   total: 364876387809
> > > 100 events never hit
> > >
> > > Do you guys have any other way to debug it?
> >
> > Can you share the cluster status of both the leader and the node that
> > always requests the database? maybe that helps.
> >
> Below are the cluster status from each node:
>
> #leader
> # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> 9944
> Name: OVN_Northbound
> Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> Server ID: 9944 (99443341-5656-464d-b242-85bb16338570)
> Address: ssl:170.168.0.4:6643
> Status: cluster member
> Role: leader
> Term: 9
> Leader: self
> Vote: self
>
> Last Election started 66169932 ms ago, reason: leadership_transfer
> Last Election won: 66169930 ms ago
> Election timer: 60000
> Log: [66, 67]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->0000 <-6aee <-7b89 ->7b89
> Disconnections: 1
> Servers:
>     9944 (9944 at ssl:170.168.0.4:6643) (self) next_index=66 match_index=66
>     6aee (6aee at ssl:170.168.0.2:6643) next_index=67 match_index=66
> last msg 7857 ms ago
>     7b89 (7b89 at ssl:170.168.0.3:6643) next_index=67 match_index=66
> last msg 7857 ms ago
>
> #non-leader 1
> # ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> 6aee
> Name: OVN_Northbound
> Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> Server ID: 6aee (6aee85c6-bd3e-45d5-896e-264ed7eaec00)
> Address: ssl:170.168.0.2:6643
> Status: cluster member
> Role: follower
> Term: 9
> Leader: 9944
> Vote: 9944
>
> Last Election started 66336770 ms ago, reason: leadership_transfer
> Last Election won: 66336767 ms ago
> Election timer: 60000
> Log: [67, 67]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: <-7b89 ->7b89 <-9944 ->9944
> Disconnections: 0
> Servers:
>     9944 (9944 at ssl:170.168.0.4:6643) last msg 11573 ms ago
>     6aee (6aee at ssl:170.168.0.2:6643) (self)
>     7b89 (7b89 at ssl:170.168.0.3:6643) last msg 66173010 ms ago
>
> #non-leader 2
> # ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
> 7b89
> Name: OVN_Northbound
> Cluster ID: a2dc (a2dcce53-a807-4708-bc9d-d0b2470c7ec5)
> Server ID: 7b89 (7b892543-9c2f-43bc-b62c-2941491dbe56)
> Address: ssl:170.168.0.3:6643
> Status: cluster member
> Role: follower
> Term: 9
> Leader: 9944
> Vote: 9944
>
> Election timer: 60000
> Log: [66, 67]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->0000 <-6aee ->9944 <-9944
> Disconnections: 1
> Servers:
>     9944 (9944 at ssl:170.168.0.4:6643) last msg 32288 ms ago
>     6aee (6aee at ssl:170.168.0.2:6643) last msg 66228156 ms ago
>     7b89 (7b89 at ssl:170.168.0.3:6643) (self)
>
> Thank you
>
> Regards,
>
> Tiago Pires
>
> >
> > Thanks a lot,
> > Felix
> >
> > >
> > > Regards,
> > >
> > > Tiago Pires
> > >
> > > --
> > >
> > >
> > >
> > >
> > > _‘Esta mensagem é direcionada apenas para os endereços constantes no
> > > cabeçalho inicial. Se você não está listado nos endereços constantes no
> > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> > > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas 
> > > estão
> > > imediatamente anuladas e proibidas’._
> > >
> > >
> > > * **‘Apesar do Magazine Luiza tomar
> > > todas as precauções razoáveis para assegurar que nenhum vírus esteja
> > > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por
> > > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> > >
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list
> > > disc...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
  • [ovs-disc... Tiago Pires via discuss
    • Re: ... Felix Huettner via discuss
      • ... Tiago Pires via discuss
        • ... Tiago Pires via discuss
          • ... Felix Huettner via discuss
            • ... Tiago Pires via discuss
              • ... Tiago Pires via discuss
                • ... Alin Serdean via discuss
                • ... Tiago Pires via discuss
                • ... Alin Serdean via discuss
                • ... Felix Huettner via discuss
                • ... Tiago Pires via discuss
                • ... Lucas Vargas Dias (Dev - Cloud IaaS Network R&D) via discuss

Reply via email to