Hi Numan, This patch [0] proposed by Lucas has the idea of fixing this behavior, could you please check it?
[0] https://patchwork.ozlabs.org/project/ovn/patch/20250127192252.140404-1-lucas.vd...@luizalabs.com/ Thanks, Tiago Pires On Tue, Jan 21, 2025 at 3:04 PM Tiago Pires <tiago.pi...@luizalabs.com> wrote: > > Hi Numan, > > When having a few hundreds of chassis and using > ovn-monitor-all="true", each individual chassi will received a flood > of events like below when nb_cfg is bumped: > > 2025-01-14T21:07:41.632Z|117984348|jsonrpc|DBG|ssl:1X.X5X.1XX.X4:6642: > send request, method="transact", > params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","26b37ca4-040a-4a45-be92-16d4fb5be9c6"]]],"row":{"nb_cfg":97559,"nb_cfg_timestamp":1736888861632},"op":"update","table":"Chassis_Private"},{"comment":"ovn-controller","op":"comment"}], > id=19859 > 2025-01-14T21:07:41.633Z|117984396|jsonrpc|DBG|ssl:1X.X5X.1XX.X4:6642: > received notification, method="update3", > params=[["monid","OVN_Southbound"],"b823ca4d-700c-45a1-b006-2992b469c0ca",{"Chassis_Private":{"1dd583d4-6eb8-42c7-a825-48f52fcf81e8":{"modify":{"nb_cfg":97559,"nb_cfg_timestamp":1736888861626}}}}] > 2025-01-14T21:07:41.633Z|117984398|jsonrpc|DBG|ssl:1X.X5X.1XX.X4:6642: > received notification, method="update3", > params=[["monid","OVN_Southbound"],"3b403314-15da-4c55-99e7-887d20bf19af",{"Chassis_Private":{"a475654d-c3aa-42d7-952d-8467eb965d21":{"modify":{"nb_cfg":97559,"nb_cfg_timestamp":1736888861628}}}}] > > We observed in our case, at the time we were receiving notification > events from other chassis, the local socket with the SB started to > increase CPU usage: > > 2025-01-14T21:07:41|11193704|poll_loop|DBG|wakeup due to [POLLIN] on > fd 3 (XX.X3.6X.1X:47382<->1X.X5X.1XX.X4:6642) at lib/stream-ssl.c:842 > (100% CPU usage) > > > Regards, > > Tiago Pires > > > On Mon, Jan 20, 2025 at 2:25 PM Numan Siddique <nsiddi...@crusoe.ai> wrote: > > > > > > > > On Thu, Jan 16, 2025 at 3:50 PM Tiago Pires via discuss > > <ovs-discuss@openvswitch.org> wrote: > >> > >> Hi all, > >> > >> Giving an update regarding this issue, our CMS was bumping the nb_cfg > >> and triggering the flood of updates to the ovn-controllers. > > > > > > Even if CMS bumps nb_cfg, it should not result in ovn-controller taking > > 100% CPU and it should not trigger > > a full recompute as we do handle nb_cfg changes incrementally. > > > > Numan > > > >> > >> > >> Regards, > >> > >> Tiago Pires > >> > >> On Mon, Jan 13, 2025 at 8:43 PM Tiago Pires <tiago.pi...@luizalabs.com> > >> wrote: > >> > > >> > Hi all, > >> > > >> > We have applied that patch's series [0] but it did not have the effect > >> > to solve our main issue as we expected. > >> > Digging more into the issue, the load spike is triggered while > >> > creating a LRP and attaching to the LR. > >> > When neutron creates a regular LRP, we have the below output without > >> > trigger the load spike issue: > >> > > >> > _uuid : a7a90ddc-3017-4dbc-97fe-ae3272d35ceb > >> > enabled : [] > >> > external_ids : > >> > {"neutron:network_name"=neutron-dadca30d-504f-4ab5-b4a0-ee36cf3616bc, > >> > "neutron:revision_number"="3", > >> > "neutron:router_name"="062fe63f-8d5e-44a4-b1ed-9d9a12b82f8f", > >> > "neutron:subnet_ids"="b2d0301a-c981-48d3-bbf8-ac314b95da5a"} > >> > gateway_chassis : [] > >> > ha_chassis_group : [] > >> > ipv6_prefix : [] > >> > ipv6_ra_configs : {} > >> > mac : "fa:16:3e:b8:7b:76" > >> > name : lrp-49721805-986f-4cea-b011-cb742d58ec7d > >> > networks : ["172.18.0.1/20"] > >> > options : {} > >> > peer : [] > >> > > >> > While managing ovn interconnect resources, we have an internal > >> > application that creates the resources via ovsdb idl and uses the > >> > external_ids to store some data. This is the output from a LRP created > >> > that trigger the load spike issue after it is created: > >> > _uuid : 8f4efa58-8f02-4875-b1e9-dae44ce79d58 > >> > enabled : [] > >> > external_ids : {"ovn-XX-app:project_type"=xxxxxx, > >> > "ovn-XX-app:resource_type"=logical_router_port, > >> > "ovn-XX-app:router_name"=neutron-xxxxx3f-8d5e-44a4-b1ed-9d9a12b82yyyy, > >> > "ovn-XX-app:tenant_id"="xxxxxef-fa63-4bf8-9ede-70dae1653yyyy"} > >> > gateway_chassis : [] > >> > ha_chassis_group : [] > >> > ipv6_prefix : [] > >> > ipv6_ra_configs : {} > >> > mac : "00:01:11:00:21:01" > >> > name : > >> > "062fe63f-8d5e-44a4-b1ed-9d9a12b82f8f-47f1760b-9ed2-4528-8b0c-69e6xxxxxxxx" > >> > networks : ["169.254.0.2/24", "fe80::2/64"] > >> > options : {} > >> > peer : [] > >> > > >> > I don't recall if something is managed in a different way when using > >> > the external_ids field like above. > >> > > >> > Running perf and having frame pointers enabled in the > >> > ovn-controller/ovn-common did not give a clue where it was spending so > >> > much CPU: > >> > > >> > + 15.07% 8.72% ovn-controller ovn-controller [.] smap_find__ > >> > + 13.54% 13.31% ovn-controller libasan.so.6.0.0 [.] > >> > __asan::Allocator::Allocate > >> > + 10.45% 0.00% ovn-controller libc.so.6 [.] > >> > __libc_start_call_main > >> > + 10.43% 0.01% ovn-controller ovn-controller [.] main > >> > + 9.88% 3.74% ovn-controller ovn-controller [.] > >> > ovsdb_datum_compare_3way > >> > + 9.71% 0.00% ovn-controller [unknown] [.] > >> > 0xbebebebebebebebe > >> > + 8.66% 0.00% ovn-controller [unknown] [.] > >> > 0x0000000000000001 > >> > + 7.05% 4.72% ovn-controller ovn-controller [.] pinctrl_run > >> > + 6.98% 0.00% ovn-controller [unknown] [.] > >> > 0000000000000000 > >> > + 6.79% 0.00% ovn-controller [unknown] [.] > >> > 0x0000006110000000 > >> > + 5.93% 5.88% ovn-controller libasan.so.6.0.0 [.] > >> > __interceptor_strncmp.part.0 > >> > + 5.36% 5.14% ovn-controller libasan.so.6.0.0 [.] > >> > __interceptor_strcmp.part.0 > >> > + 4.96% 4.91% ovn-controller libasan.so.6.0.0 [.] > >> > __sanitizer::Quarantine<__asan::QuarantineCallback, > >> > __asan::AsanChunk>::DoRecycle > >> > + 4.87% 4.64% ovn-controller libasan.so.6.0.0 [.] > >> > __asan::QuickCheckForUnpoisonedRegion > >> > + 4.44% 4.35% ovn-controller ovn-controller [.] json_string > >> > > >> > The above was our output while running perf while the below output was > >> > saying the ovn-controller was busy receiving SB changes: > >> > > >> > poll_loop|DBG|wakeup due to [POLLIN] on fd 3 > >> > (1XX.XX.XX.X99:47382<->1XX.XX.XX.X16:6642) at lib/stream-ssl.c:842 > >> > (100% CPU usage) > >> > > >> > During the time that the ovn-controller is busy, we counted in just > >> > one minute 170 runs like below: > >> > > >> > 2025-01-13T20:39:36.221Z|6241378|inc_proc_eng|DBG|Initializing new run > >> > 2025-01-13T20:39:36.221Z|6241379|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_sb_global, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241380|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: northd_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241381|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_dhcp_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241382|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_dhcpv6_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241383|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: dhcp_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241384|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_address_set, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241385|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: addr_sets, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241386|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_port_group, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241387|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: ofctrl_is_connected, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241388|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_open_vswitch, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241389|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_bridge, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241390|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_qos, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241391|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_chassis, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241392|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_datapath_binding, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241393|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_port_binding, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241394|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: postponed_ports, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241395|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_port, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241396|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_interface, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.221Z|6241397|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: ovs_interface_shadow, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.222Z|6241398|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: runtime_data, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.222Z|6241399|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: port_groups, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241400|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_chassis_template_var, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241401|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: template_vars, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241402|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: non_vif_data, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241403|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_multicast_group, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241404|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_flow_sample_collector_set, old_state Unchanged, new_state > >> > Stale > >> > 2025-01-13T20:39:36.223Z|6241405|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_mac_binding, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241406|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_static_mac_binding, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241407|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_logical_flow, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241408|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_logical_dp_group, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241409|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_dns, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241410|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_load_balancer, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241411|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: lb_data, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241412|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_fdb, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241413|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_meter, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241414|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: logical_flow_output, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241415|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: ct_zones, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241416|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: if_status_mgr, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241417|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: activated_ports, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241418|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_encap, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241419|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: mff_ovn_geneve, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241420|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: physical_flow_output, old_state Updated, new_state Stale > >> > 2025-01-13T20:39:36.223Z|6241421|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: flow_output, old_state Updated, new_state Stale > >> > > >> > In a few times with less nodes: > >> > > >> > 2025-01-13T20:39:38.433Z|6242415|inc_proc_eng|DBG|Initializing new run > >> > 2025-01-13T20:39:38.433Z|6242416|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_sb_global, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242417|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: northd_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|06875|poll_loop(stopwatch2)|DBG|wakeup due to > >> > [POLLIN] on fd 23 (FIFO pipe:[449309164]) at lib/stopwatch.c:456 (0% > >> > CPU usage) > >> > 2025-01-13T20:39:38.433Z|6242418|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_dhcp_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242419|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_dhcpv6_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242420|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: dhcp_options, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242421|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_address_set, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242422|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: addr_sets, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242423|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_port_group, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242424|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: ofctrl_is_connected, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242425|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_open_vswitch, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242426|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_bridge, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242427|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_qos, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242428|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_chassis, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242429|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_datapath_binding, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242430|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: SB_port_binding, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242431|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: postponed_ports, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242432|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_port, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242433|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: OVS_interface, old_state Unchanged, new_state Stale > >> > 2025-01-13T20:39:38.433Z|6242434|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > node: ovs_interface_shadow, old_state Unchanged, new_state Stale > >> > > >> > Regards, > >> > > >> > Tiago Pires > >> > > >> > [0] > >> > https://github.com/ovn-org/ovn/commit/2a57b204459612a9a4edb49b9c5ebc44e101ee93#diff-97e16400e2bcbb4b65f7f3b8f2c05e9e8e56148df77719b71d60f235e3bcc0edR4379 > >> > > >> > On Tue, Jan 7, 2025 at 6:47 PM Tiago Pires <tiago.pi...@luizalabs.com> > >> > wrote: > >> > > > >> > > On Tue, Jan 7, 2025 at 11:55 AM Felix Huettner > >> > > <felix.huettner@stackit.cloud> wrote: > >> > > > > >> > > > On Tue, Jan 07, 2025 at 11:31:49AM -0300, Tiago Pires via discuss > >> > > > wrote: > >> > > > > Hi guys, > >> > > > > > >> > > > > We followed the debug steps and using perf we got the following > >> > > > > performance trace: > >> > > > > > >> > > > > - 84.56% 0.01% ovn-controller [.] main > >> > > > > - 84.56% main > >> > > > > - 78.52% pinctrl_run > >> > > > > - 39.14% lport_lookup_by_name > >> > > > > - 36.66% ovsdb_idl_index_find > >> > > > > - skiplist_find > >> > > > > - 36.36% skiplist_forward_to (inlined) > >> > > > > - skiplist_forward_to_ > >> > > > > - 31.99% > >> > > > > ovsdb_idl_index_generic_comparer.part.0 > >> > > > > - 24.68% ovsdb_datum_compare_3way > >> > > > > - 13.18% ovsdb_atom_compare_3way > >> > > > > 8.38% json_string > >> > > > > 6.27% __strcmp_evex > >> > > > > 6.28% ovsdb_idl_read > >> > > > > + 1.07% sbrec_port_binding_index_set_logical_port > >> > > > > 0.72% ovsdb_idl_index_destroy_row > >> > > > > 0.60% ovsdb_idl_index_init_row > >> > > > > - 15.04% extract_ip_addresses > >> > > > > - parse_and_store_addresses > >> > > > > - 9.90% add_ipv6_netaddr > >> > > > > - 9.36% __GI_inet_ntop (inlined) > >> > > > > + inet_ntop6 (inlined) > >> > > > > + 2.78% ipv6_parse_cidr_len > >> > > > > + 2.14% ip_parse_cidr_len > >> > > > > + 8.35% smap_get > >> > > > > + 4.93% smap_get_int > >> > > > > + 3.00% shash_find_data > >> > > > > + 2.00% ipv6_parse > >> > > > > + 1.31% eth_addr_from_string > >> > > > > - 4.46% bfd_run > >> > > > > - 2.00% sset_add > >> > > > > 1.43% hash_bytes > >> > > > > 0.81% encaps_run > >> > > > > >> > > > Hi Tiago, > >> > > > > >> > > > what i find interesting is that pinctrl_run never calls > >> > > > lport_lookup_by_name and extract_ip_addresses directly. So > >> > > > unfortunately its a > >> > > > little bit of guesswork where these calls come from, maybe they are > >> > > > called in multiple different functions and perf just added that up > >> > > > together. > >> > > > > >> > > > You could try compiling with framepointers and using that instead of > >> > > > dwarf for the call graph. But i am not sure if this helps. > >> > > > > >> > > > just wanted to drop that as my first feelings on this. Maybe it > >> > > > helps. > >> > > > > >> > > Hi Felix, > >> > > > >> > > Yes, that is odd as you verified, the loop atom_arrays_compare_3way is > >> > > always checking the json messages and it seems there is a performance > >> > > problem there. > >> > > We are testing this patch [0] and checking if that will work as a > >> > > short term solution and we got in the lab a reduction of the database > >> > > update size from 1.1 MB to just 24 KB. > >> > > > >> > > [0] > >> > > https://github.com/ovn-org/ovn/commit/2a57b204459612a9a4edb49b9c5ebc44e101ee93#diff-97e16400e2bcbb4b65f7f3b8f2c05e9e8e56148df77719b71d60f235e3bcc0edR4379 > >> > > > >> > > Regards, > >> > > > >> > > Tiago Pires > >> > > > >> > > > > > >> > > > > Just to recall, only when the ovn-controller receives the update > >> > > > > for > >> > > > > ovn-interconnect logical flows (that's not happen with regular > >> > > > > logical > >> > > > > flows) that load spike happens. > >> > > > > We are looking into the lport_lookup_by_name function to try to > >> > > > > understand what is the relation with the ovn-interconnect logical > >> > > > > flows. > >> > > > > > >> > > > > Regards, > >> > > > > > >> > > > > Tiago Pires > >> > > > > > >> > > > > On Tue, Jan 7, 2025 at 5:54 AM Felix Huettner > >> > > > > <felix.huettner@stackit.cloud> wrote: > >> > > > > > > >> > > > > > On Sat, Jan 04, 2025 at 09:59:28PM -0500, Satish Patel wrote: > >> > > > > > > Hi Felix, > >> > > > > > > > >> > > > > > > Are there any documents on OVN troubleshooting the way you > >> > > > > > > explain? LIke > >> > > > > > > what logs to look at, what tools to use to dump perf data and > >> > > > > > > tech support > >> > > > > > > style data to drill down the issue before we report to > >> > > > > > > developers. I > >> > > > > > > heard in the past related ovn-controller CPU usage 100% etc. > >> > > > > > > But there is > >> > > > > > > no better way to find out what causes those spikes and errors. > >> > > > > > > >> > > > > > Hi Satish, > >> > > > > > > >> > > > > > i do not know of any documents regarding that. The questions > >> > > > > > below are > >> > > > > > based on my experience of looking at similar issues in the past. > >> > > > > > > >> > > > > > I also do not yet have a nice way to get perf information for > >> > > > > > spikes, > >> > > > > > but most of the time running inc-engine/recompute converts a > >> > > > > > spike to a > >> > > > > > continuous 100% cpu usage, which i find easier to analyse. > >> > > > > > > >> > > > > > To get the perf information you can do the following: > >> > > > > > * Ensure you have ovn-controller debug symbols available (that > >> > > > > > depends > >> > > > > > on your installation method how to do that) > >> > > > > > * Install "perf". E.g. for ubuntu it would be something like > >> > > > > > "apt install linux-tools-generic binutils libc-dbg" > >> > > > > > * Run "perf record -p $(pidof ovn-controller) -g --call-graph > >> > > > > > dwarf" > >> > > > > > while the recomputes are running > >> > > > > > * Press CTRL+C after a few recompute iterations > >> > > > > > * Wait for "perf record" to finish. Do not press CTRL+C again > >> > > > > > until it > >> > > > > > is done. Depending on how long it ran this can take some time. > >> > > > > > * Run "perf report -g". This will parse the profile and output a > >> > > > > > nice ui > >> > > > > > * You will see a list of functions and before them the > >> > > > > > percentage of CPU > >> > > > > > time that the process spend in there (once in the function > >> > > > > > only and > >> > > > > > once in the function and all subfunctions) > >> > > > > > * Expand the tree using the "+" key and follow mostly the top 2 > >> > > > > > percentages after expansion > >> > > > > > * Probably somewhere in there there is something to optimize (or > >> > > > > > that > >> > > > > > has already been optimized) > >> > > > > > > >> > > > > > You will see something similar as the "performance trace" in > >> > > > > > this commit > >> > > > > > message > >> > > > > > https://github.com/ovn-org/ovn/commit/ea619bf64ce1ae00996928f7ed15309a690ee3b9 > >> > > > > > In that case the bfd_calculate_chassis was the one that was > >> > > > > > optimized. > >> > > > > > > >> > > > > > I hope that helps. > >> > > > > > > >> > > > > > Felix > >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > On Thu, Jan 2, 2025 at 9:48 AM Felix Huettner via discuss < > >> > > > > > > ovs-discuss@openvswitch.org> wrote: > >> > > > > > > > >> > > > > > > > On Thu, Jan 02, 2025 at 08:47:34AM -0300, Tiago Pires via > >> > > > > > > > discuss wrote: > >> > > > > > > > > Hi all, > >> > > > > > > > > > >> > > > > > > > > While creating ovn interconnect logical flows (Transit > >> > > > > > > > > switch, LRPs > >> > > > > > > > > and LSPs) for a Logical Router in an AZ, there is load > >> > > > > > > > > spike in all > >> > > > > > > > > chassis's ovn-controllers to 100%. > >> > > > > > > > > This remains for 1 minute and half and returns to normal > >> > > > > > > > > after the > >> > > > > > > > > logical flows are installed. > >> > > > > > > > > Since this behavior only manifests itself while creating > >> > > > > > > > > ovn > >> > > > > > > > > interconnect logical flows, enabling the debug logs shows > >> > > > > > > > > that the > >> > > > > > > > > ovn-controller receives from SB the following change: > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:15.023Z|10562646|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: SB_encap, old_state Unchanged, new_state Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:15.023Z|10562647|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: mff_ovn_geneve, old_state Unchanged, new_state Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:15.023Z|10562648|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: physical_flow_output, old_state Unchanged, new_state > >> > > > > > > > > Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:15.023Z|10562649|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: flow_output, old_state Unchanged, new_state Stale > >> > > > > > > > > 2024-12-26T17:38:15.032Z|10562650|jsonrpc|DBG|ssl:1XX.XX.XX.X16:6642: > >> > > > > > > > > received notification, method="update3", > >> > > > > > > > > > >> > > > > > > > params=[["monid","OVN_Southbound"],"9c54ec46-83a4-4deb-877d-3089a99ae482",{"SB_Global > >> > > > > > > > > > >> > > > > > > > ":{"019c4225-e637-4881-b382-36fdfcb9e0c9":{"modify":{"nb_cfg":34448}}},"Port_Binding":{"4230d38a-75cd-4f1c-9fde-62e753958656":{"insert":{"up":false,"tunnel_key":7,"mac":"16:00:01:f4:03:cb > >> > > > > > > > > 169.254 .30.3/24 > >> > > > > > > > > > >> > > > > > > > fe80:30::3/64","datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-793b8","type":"patch"}},"f4d71615-88b8-4274-87ab-448cf8 > >> > > > > > > > > c6a6ee":{"insert":{"up":false,"tunnel_key":6,"mac":"16:00:01:75:f8:87 > >> > > > > > > > > 169.254.10.1/24 > >> > > > > > > > > > >> > > > > > > > fe80:10::1/64","datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc- > >> > > > > > > > > > >> > > > > > > > 9486-e66f9d7b25bf-lrp-ba453","type":"patch"}},"3eac57dd-386c-43b0-bf5b-8225941b52cb":{"insert":{"up":false,"tunnel_key":5,"mac":"16:00:01:9a:30:fa > >> > > > > > > > > 169.254.40.4/24 fe80:40::4/64","datapath":["uuid > >> > > > > > > > > > >> > > > > > > > ","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-36ac0","type":"patch"}},"7295d5c6-ff8d-4f4f-b1a5-eb1ca87fcb7d":{"insert":{"up":false,"tunnel_key > >> > > > > > > > > ":8,"mac":"16:00:01:de:df:b5 169.254.20.2/24 > >> > > > > > > > > > >> > > > > > > > fe80:20::2/64","datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-c21c6","type":"patc > >> > > > > > > > > > >> > > > > > > > h"}}},"Logical_Flow":{"224852f2-9a46-4569-b51d-5d340ed6f4b4":{"modify":{"logical_dp_group":["uuid","bd825723-9a0e-451c-b5ad-b7d6f9cc75e4"]}},"05f5d515-bccf-47c1-9d55-8a4dfce88822":{"insert":{"pip > >> > > > > > > > > eline":"ingress","match":"inport == > >> > > > > > > > > \"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-36ac0\" && > >> > > > > > > > > ip6.dst == > >> > > > > > > > > > >> > > > > > > > fe80::/64","priority":194,"logical_datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"]," > >> > > > > > > > > table_id":13,"actions":"ip.ttl--; reg8[0..15] = 0; xxreg0 > >> > > > > > > > > = ip6.dst; > >> > > > > > > > > xxreg1 = fe80::1400:1ff:fe9a:30fa; eth.src = > >> > > > > > > > > 16:00:01:9a:30:fa; > >> > > > > > > > > outport = > >> > > > > > > > > \"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-36ac0\"; > >> > > > > > > > > flags.loopback = 1; > >> > > > > > > > > > >> > > > > > > > next;"}},"c67ec1ee-bfc1-47e0-b35c-95fcc2adc13b":{"modify":{"logical_dp_group": > >> > > > > > > > > > >> > > > > > > > > The addresses from the above change (16:00:01:75:f8:87 > >> > > > > > > > > 169.254.10.1/24 > >> > > > > > > > > fe80:10::1/64) are configured into the LRP connected to > >> > > > > > > > > the transit > >> > > > > > > > > switch. > >> > > > > > > > > > >> > > > > > > > > After while another incremental processing is started: > >> > > > > > > > > > >> > > > > > > > > 2024-12-26T17:38:18.035Z|11190795|coverage|INFO|Skipping > >> > > > > > > > > details of > >> > > > > > > > > duplicate event coverage for hash=db09e857 > >> > > > > > > > > 2024-12-26T17:38:18.035Z|11190796|poll_loop|DBG|wakeup due > >> > > > > > > > > to [POLLIN] > >> > > > > > > > > on fd 3 (1XX.XX.XX.X99:47382<->1XX.XX.XX.X16:6642) at > >> > > > > > > > > lib/stream-ssl.c:842 (100% CPU usage) > >> > > > > > > > > 2024-12-26T17:38:18.035Z|11190797|inc_proc_eng|DBG|Initializing > >> > > > > > > > > new > >> > > > > > > > > run > >> > > > > > > > 2024-12-26T17:38:18.035Z|11190798|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: SB_sb_global, old_state Updated, new_state Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:18.035Z|11190799|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: northd_options, old_state Unchanged, new_state Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:18.035Z|11190800|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: SB_dhcp_options, old_state Unchanged, new_state Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:18.035Z|11190801|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: SB_dhcpv6_options, old_state Unchanged, new_state > >> > > > > > > > > Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:18.035Z|11190802|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: dhcp_options, old_state Unchanged, new_state Stale > >> > > > > > > > > > >> > > > > > > > 2024-12-26T17:38:18.035Z|11190803|inc_proc_eng|DBG|lib/inc-proc-eng.c:379: > >> > > > > > > > > node: SB_address_set, old_state Unchanged, new_state Stale > >> > > > > > > > > > >> > > > > > > > > In other chassis we observerd the recompute time > >> > > > > > > > > increasing to 18secs: > >> > > > > > > > > > >> > > > > > > > > 2024-12-26T17:38:46.463Z|40096697|poll_loop|INFO|Dropped > >> > > > > > > > > 72 log > >> > > > > > > > > messages in last 6 seconds (most recently, 0 seconds ago) > >> > > > > > > > > due to > >> > > > > > > > > excessive rate > >> > > > > > > > > 2024-12-26T17:38:46.463Z|40096698|poll_loop|INFO|wakeup > >> > > > > > > > > due to [POLLIN] on fd 26 > >> > > > > > > > > (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at > >> > > > > > > > > lib/stream-ssl.c:842 (99% CPU usage) > >> > > > > > > > > 2024-12-26T17:38:52.442Z|40096699|poll_loop|INFO|Dropped > >> > > > > > > > > 72 log > >> > > > > > > > > messages in last 6 seconds (most recently, 0 seconds ago) > >> > > > > > > > > due to > >> > > > > > > > > excessive rate > >> > > > > > > > > 2024-12-26T17:38:52.442Z|40096700|poll_loop|INFO|wakeup > >> > > > > > > > > due to [POLLIN] on fd 26 > >> > > > > > > > > (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at > >> > > > > > > > > lib/stream-ssl.c:842 (102% CPU usage) > >> > > > > > > > > 2024-12-26T17:38:58.409Z|40096701|poll_loop|INFO|Dropped > >> > > > > > > > > 72 log > >> > > > > > > > > messages in last 6 seconds (most recently, 0 seconds ago) > >> > > > > > > > > due to > >> > > > > > > > > excessive rate > >> > > > > > > > > 2024-12-26T17:38:58.409Z|40096702|poll_loop|INFO|wakeup > >> > > > > > > > > due to [POLLIN] on fd 26 > >> > > > > > > > > (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at > >> > > > > > > > > lib/stream-ssl.c:842 (101% CPU usage) > >> > > > > > > > > 2024-12-26T17:39:04.477Z|40096703|poll_loop|INFO|Dropped > >> > > > > > > > > 72 log > >> > > > > > > > > messages in last 6 seconds (most recently, 0 seconds ago) > >> > > > > > > > > due to > >> > > > > > > > > excessive rate > >> > > > > > > > > 2024-12-26T17:39:04.477Z|40096704|poll_loop|INFO|wakeup > >> > > > > > > > > due to [POLLIN] on fd 26 > >> > > > > > > > > (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at > >> > > > > > > > > lib/stream-ssl.c:842 (98% CPU usage) > >> > > > > > > > > 2024-12-26T17:39:27.896Z|40096705|inc_proc_eng|INFO|node: > >> > > > > > > > > logical_flow_output, recompute (forced) took 18932ms > >> > > > > > > > > 2024-12-26T17:39:28.612Z|40096706|inc_proc_eng|INFO|node: > >> > > > > > > > > physical_flow_output, recompute (forced) took 670ms > >> > > > > > > > > 2024-12-26T17:39:30.614Z|40096707|timeval|WARN|Unreasonably > >> > > > > > > > > long > >> > > > > > > > > 22134ms poll interval (19780ms user, 2315ms system) > >> > > > > > > > > 2024-12-26T17:39:30.614Z|40096708|timeval|WARN|faults: > >> > > > > > > > > 2634558 minor, > >> > > > > > > > > 0 major > >> > > > > > > > > 2024-12-26T17:39:30.614Z|40096709|timeval|WARN|disk: 0 > >> > > > > > > > > reads, > >> > > > > > > > > 8 writes > >> > > > > > > > > 2024-12-26T17:39:30.614Z|40096710|timeval|WARN|context > >> > > > > > > > > switches: 0 voluntary, 2688 involuntary > >> > > > > > > > > > >> > > > > > > > > This behavior only happens with ovn interconnect logical > >> > > > > > > > > flows, do you > >> > > > > > > > > know any known issue regarding it? > >> > > > > > > > > >> > > > > > > > Hi Tiago, > >> > > > > > > > > >> > > > > > > > from the log above it seems like some part of the previous > >> > > > > > > > incremental > >> > > > > > > > processing run failed and caused a recompute afterwards. So > >> > > > > > > > the last > >> > > > > > > > logs are no longer incrementally processed. > >> > > > > > > > > >> > > > > > > > There should be some log line in between similar to > >> > > > > > > > "...forcing recompute > >> > > > > > > > next time...". That might help you find the cause of the > >> > > > > > > > recompute. > >> > > > > > > > > >> > > > > > > > Additionally you should be able to trigger a recompute > >> > > > > > > > manually with > >> > > > > > > > "ovn-appctl inc-engine/recompute". Then you will probably > >> > > > > > > > observe > >> > > > > > > > similar logs. > >> > > > > > > > You could then attach a perf to the ovn-controller for one > >> > > > > > > > or more > >> > > > > > > > recomputes and use "perf record" and "perf report" with call > >> > > > > > > > graphs to > >> > > > > > > > figure out what exactly takes long. That might be helpful > >> > > > > > > > information > >> > > > > > > > for further optimizations. > >> > > > > > > > > >> > > > > > > > Also ovn got significant improvements in incremental > >> > > > > > > > processing since > >> > > > > > > > 23.03, so an upgrade might help as well. > >> > > > > > > > > >> > > > > > > > Hope that helps > >> > > > > > > > Felix > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > #Versions in use > >> > > > > > > > > > >> > > > > > > > > $ ovn-nbctl -V > >> > > > > > > > > ovn-nbctl 23.03.1 > >> > > > > > > > > Open vSwitch Library 3.1.3 > >> > > > > > > > > DB Schema 7.0.0 > >> > > > > > > > > > >> > > > > > > > > $ ovs-vsctl -V > >> > > > > > > > > ovs-vsctl (Open vSwitch) 3.1.3 > >> > > > > > > > > DB Schema 8.3.1 > >> > > > > > > > > > >> > > > > > > > > external_ids: ovn-monitor-all="true", > >> > > > > > > > > ovn-ofctrl-wait-before-clear="10000", > >> > > > > > > > > ovn-openflow-probe-interval="0", > >> > > > > > > > > ovn-remote-probe-interval="180000" > >> > > > > > > > > > >> > > > > > > > > Regards, > >> > > > > > > > > > >> > > > > > > > > Tiago Pires > >> > > > > > > > > > >> > > > > > > > > -- > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > _‘Esta mensagem é direcionada apenas para os endereços > >> > > > > > > > > constantes no > >> > > > > > > > > cabeçalho inicial. Se você não está listado nos endereços > >> > > > > > > > > constantes no > >> > > > > > > > > cabeçalho, pedimos-lhe que desconsidere completamente o > >> > > > > > > > > conteúdo dessa > >> > > > > > > > > mensagem e cuja cópia, encaminhamento e/ou execução das > >> > > > > > > > > ações citadas > >> > > > > > > > estão > >> > > > > > > > > imediatamente anuladas e proibidas’._ > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > * **‘Apesar do Magazine Luiza tomar > >> > > > > > > > > todas as precauções razoáveis para assegurar que nenhum > >> > > > > > > > > vírus esteja > >> > > > > > > > > presente nesse e-mail, a empresa não poderá aceitar a > >> > > > > > > > > responsabilidade > >> > > > > > > > por > >> > > > > > > > > quaisquer perdas ou danos causados por esse e-mail ou por > >> > > > > > > > > seus anexos’.* > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > _______________________________________________ > >> > > > > > > > > discuss mailing list > >> > > > > > > > > disc...@openvswitch.org > >> > > > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > > > > > > > _______________________________________________ > >> > > > > > > > discuss mailing list > >> > > > > > > > disc...@openvswitch.org > >> > > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > > > > > > > > >> > > > > > >> > > > > -- > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > _‘Esta mensagem é direcionada apenas para os endereços constantes > >> > > > > no > >> > > > > cabeçalho inicial. Se você não está listado nos endereços > >> > > > > constantes no > >> > > > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo > >> > > > > dessa > >> > > > > mensagem e cuja cópia, encaminhamento e/ou execução das ações > >> > > > > citadas estão > >> > > > > imediatamente anuladas e proibidas’._ > >> > > > > > >> > > > > > >> > > > > * **‘Apesar do Magazine Luiza tomar > >> > > > > todas as precauções razoáveis para assegurar que nenhum vírus > >> > > > > esteja > >> > > > > presente nesse e-mail, a empresa não poderá aceitar a > >> > > > > responsabilidade por > >> > > > > quaisquer perdas ou danos causados por esse e-mail ou por seus > >> > > > > anexos’.* > >> > > > > > >> > > > > > >> > > > > > >> > > > > _______________________________________________ > >> > > > > discuss mailing list > >> > > > > disc...@openvswitch.org > >> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > >> -- > >> > >> > >> > >> > >> _‘Esta mensagem é direcionada apenas para os endereços constantes no > >> cabeçalho inicial. Se você não está listado nos endereços constantes no > >> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa > >> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão > >> imediatamente anuladas e proibidas’._ > >> > >> > >> * **‘Apesar do Magazine Luiza tomar > >> todas as precauções razoáveis para assegurar que nenhum vírus esteja > >> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por > >> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* > >> > >> > >> > >> _______________________________________________ > >> discuss mailing list > >> disc...@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss