On Thu, Apr 10, 2025 at 12:25 AM Dumitru Ceara <dce...@redhat.com> wrote: > > On 4/9/25 5:58 PM, Numan Siddique wrote: > > On Tue, Apr 8, 2025 at 5:57 PM Paulo Guilherme Da Silva via discuss < > > ovs-discuss@openvswitch.org> wrote: > > > >> Hi everyone, > > Hi all, > > >> > >> I wrote this email to share with the community the behavior we are > >> observing in our infrastructure, the high processing of ovn-ic. > >> > >> We can simulate the behavior using ovn-fake-multinode running in a > >> sandbox. At the moment we're using 24.03 OVN version. > >> > >> How you can see, we have 3 zones > >> > >> root@vm-se1-paulo:~/ovn-fake-multinode# podman ps > >> CONTAINER ID IMAGE COMMAND CREATED > >> STATUS PORTS NAMES > >> 15bb7e2d21db localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-central-az1-1 > >> 8c21baf990b8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-central-az2-1 > >> 54fc243cbb3c localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-central-az3-1 > >> aac92051d8a3 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-1 > >> c053e82326a7 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-2 > >> 25705f7b100f localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-3 > >> ebd07e74b2f8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-4 > >> 72f8c45178f8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-5 > >> 43ca78b73401 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-6 > >> b055c8d42860 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-7 > >> 7fea15004dd9 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-8 > >> 0349d294cc07 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-9 > >> 2fa3d537a506 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-10 > >> 26c07aff9b78 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-11 > >> 83210fb30a91 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-gw-12 > >> b4dff8b37518 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-chassis-1 > >> 606655db8d8b localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-chassis-2 > >> d45da63d8713 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-chassis-3 > >> 4b960252e7a3 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-chassis-4 > >> 56ecfdbd4580 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days > >> ago Up 9 days ago ovn-chassis-5 > >> > >> > >> We currently have 3000 routers deployed in each zone of our sdn. And with > >> this value since we can see load and the impact on ovn-ic daemon processing.
Could you describe more about your topology? Does each router of each zone need to interconnect with its counterparts in other 2 zones? If that's the requirement, then yes the current simple recompute loop of ovn-ic may not scale. And I agree incremental-processing is the most appropriate solution. Best, Han > >> > >> 1. Even when we don't have new resources being processed, the cpu load > >> fluctuantes between 80% and 99% of cpu time, all the time. > >> > >> 2. When we created new resources, the load got close in 99% of time cpu, > >> until the end of new deployments. > >> > >> Our concern is that ovn-ic will not be able to scale to future demand, > >> since the number of routers is expected to grow in the coming months. > >> > >> We build version with symbols and frame-pointer enable and we use it in > >> conjunction with the perf tool to understand the situation. > >> # perf record -p $(pidof ovn-ic) -g --call-graph dwarf > >> > >> while a script is creating new resources, we capture the prof analysis and > >> as a result we obtained > >> # perf report -g > >> > >> Samples: 53K of event 'cpu-clock:pppH', Event count (approx.): 13339250000 > >> Children Self Command Shared Object Symbol > >> + 99.95% 1.24% ovn-ic ovn-ic [.] main > >> + 99.93% 0.00% ovn-ic ovn-ic [.] _start > >> + 99.93% 0.00% ovn-ic libc.so.6 [.] __libc_start_main > >> + 99.93% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2cebd8f > >> + 58.40% 2.01% ovn-ic ovn-ic [.] > >> ovsdb_idl_index_generic_comparer.part.0 > >> + 58.34% 0.04% ovn-ic ovn-ic [.] skiplist_find > >> + 57.82% 4.93% ovn-ic ovn-ic [.] skiplist_forward_to_ > >> + 57.82% 0.00% ovn-ic ovn-ic [.] skiplist_forward_to > >> (inlined) > >> + 46.84% 10.29% ovn-ic ovn-ic [.] > >> ovsdb_datum_compare_3way > >> + 38.25% 0.01% ovn-ic ovn-ic [.] ovsdb_idl_index_find > >> + 37.93% 1.25% ovn-ic ovn-ic [.] port_binding_run > >> + 20.33% 6.87% ovn-ic ovn-ic [.] > >> ovsdb_atom_compare_3way > >> + 20.10% 0.01% ovn-ic ovn-ic [.] > >> ovsdb_idl_cursor_first_eq > >> + 15.92% 0.02% ovn-ic ovn-ic [.] > >> get_lrp_name_by_ts_port_name > >> + 13.44% 13.38% ovn-ic ovn-ic [.] json_string > >> + 9.97% 0.20% ovn-ic ovn-ic [.] ip46_parse_cidr > >> + 9.55% 9.49% ovn-ic ovn-ic [.] ovsdb_idl_read > >> + 8.40% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73806 > >> + 8.37% 8.37% ovn-ic libc.so.6 [.] 0x00000000001b1806 > >> + 7.53% 0.19% ovn-ic ovn-ic [.] ip_parse_masked_len > >> + 7.32% 0.05% ovn-ic ovn-ic [.] ip_parse_cidr > >> + 6.88% 4.64% ovn-ic ovn-ic [.] smap_find__ > >> + 6.79% 0.32% ovn-ic ovn-ic [.] ovs_scan_len > >> + 6.46% 4.75% ovn-ic ovn-ic [.] ovs_scan__ > >> + 6.35% 0.03% ovn-ic ovn-ic [.] > >> ovsdb_idl_cursor_next_eq > >> + 3.71% 0.09% ovn-ic ovn-ic [.] smap_get > >> + 2.59% 0.04% ovn-ic ovn-ic [.] smap_get_uuid > >> + 2.26% 0.06% ovn-ic ovn-ic [.] ipv6_parse_cidr > >> + 2.16% 0.10% ovn-ic ovn-ic [.] ipv6_parse_masked_len > >> + 2.16% 0.05% ovn-ic ovn-ic [.] xasprintf > >> + 2.11% 0.16% ovn-ic ovn-ic [.] xvasprintf > >> + 2.08% 0.12% ovn-ic ovn-ic [.] ts_run > >> + 1.88% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73b7e > >> + 1.87% 1.87% ovn-ic libc.so.6 [.] 0x00000000001b1b7e > >> + 1.87% 1.78% ovn-ic ovn-ic [.] hash_bytes > >> + 1.66% 0.00% ovn-ic ovn-ic [.] extract_lsp_addresses > >> + 1.66% 0.01% ovn-ic ovn-ic [.] > >> parse_and_store_addresses > >> > >> In attached I share the result increasing the zoom in on functions that > >> consume the most CPU time > >> > >> In each cycle of the loop, it goes through these 4 main functions that in > >> turn iterate over the main tables of the ovnsb_idl, ovnnb_idl, ovnisb_idl > >> and ovninb_idl. Following the concepts of Big O notation, the larger the > >> tables, the greater the processing consumption. We believe that this is > >> what we are seeing here. > >> > >> static void > >> ovn_db_run(struct ic_context *ctx, > >> const struct icsbrec_availability_zone *az) > >> { > >> ts_run(ctx); > >> gateway_run(ctx, az); > >> port_binding_run(ctx, az); > >> route_run(ctx, az); > >> } > >> > >> To resolve the first behavior we have worked trying improve the > >> performance in this event loop in the main function of the process., we > >> apply a check to the state_change_idl->last_ovnsb_seqno attribute comparing > >> the current value with the last state to execute the loop only at times of > >> change and this approach proved to be efficient. > >> > >> Now, regarding the second behavior described above, remembering that > >> currently the ovn-ic process is single-thread, the solution is more > >> complex. I think the correct way to solve this scalability issue would be > >> to implement incremental processing before proposing a multi-thread system. > >> > > > > I think adding incremental processing (I-P) support seems to be the right > > way to go. Adding I-P should address the first concern too IMO. But you > > can definitely submit a patch to address it and we can discuss it in the > > patch. > > > > I agree, it seems better to me to try to improve the processing step > instead of trying to throw threads at the problem. > > > For the OVN community I think adding I-P for ovn-ic was not a priority. > > Probably that's the case with many of the deployments. If you want to add > > I-P to ovn-ic, I have no objections. You have to do the heavy lifting > > though :) > > > > @Dumitru Ceara <dce...@redhat.com> @Mark Michelson <mmich...@redhat.com> @Han > > Zhou <hz...@ovn.org> Thoughts ? > > > > Indeed, the performance of the ovn-ic daemon wasn't really a priority > until now. That being said, I'm available to try to answer questions or > troubleshoot issues that might arise while implementing incremental > processing for ovn-ic. > > > > Thanks > > Numan > > > > We would like to hear your thoughts on this matter and whether we are > >> approaching the topic correctly. Please let us know if there are any other > >> debugging commands that would help us with this investigation. > >> > >> Thank you in advance > >> > >> -- > >> *Paulo Guilherme da Silva* > >> IaaS - Networking > >> guilherme.pa...@luizalabs.com > >> > > Regards, > Dumitru
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss