On 4/9/25 5:58 PM, Numan Siddique wrote:
> On Tue, Apr 8, 2025 at 5:57 PM Paulo Guilherme Da Silva via discuss <
> ovs-discuss@openvswitch.org> wrote:
> 
>> Hi everyone,

Hi all,

>>
>> I wrote this email to share with the community the behavior we are
>> observing in our infrastructure, the high processing of ovn-ic.
>>
>> We can simulate the behavior using ovn-fake-multinode running in a
>> sandbox. At the moment we're using 24.03 OVN version.
>>
>> How you can see, we have 3 zones
>>
>> root@vm-se1-paulo:~/ovn-fake-multinode# podman ps
>> CONTAINER ID  IMAGE                                COMMAND         CREATED
>>     STATUS         PORTS       NAMES
>> 15bb7e2d21db  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-central-az1-1
>> 8c21baf990b8  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-central-az2-1
>> 54fc243cbb3c  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-central-az3-1
>> aac92051d8a3  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-1
>> c053e82326a7  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-2
>> 25705f7b100f  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-3
>> ebd07e74b2f8  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-4
>> 72f8c45178f8  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-5
>> 43ca78b73401  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-6
>> b055c8d42860  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-7
>> 7fea15004dd9  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-8
>> 0349d294cc07  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-9
>> 2fa3d537a506  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-10
>> 26c07aff9b78  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-11
>> 83210fb30a91  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-gw-12
>> b4dff8b37518  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-chassis-1
>> 606655db8d8b  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-chassis-2
>> d45da63d8713  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-chassis-3
>> 4b960252e7a3  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-chassis-4
>> 56ecfdbd4580  localhost/ovn/ovn-multi-node:latest  /usr/sbin/init  9 days
>> ago  Up 9 days ago              ovn-chassis-5
>>
>>
>> We currently have 3000 routers deployed in each zone of our sdn. And with
>> this value since we can see load and the impact on ovn-ic daemon processing.
>>
>> 1. Even when we don't have new resources being processed, the cpu load
>> fluctuantes between 80% and 99% of cpu time, all the time.
>>
>> 2. When we created new resources, the load got close in 99% of time cpu,
>> until the end of new deployments.
>>
>> Our concern is that ovn-ic will not be able to scale to future demand,
>> since the number of routers is expected to grow in the coming months.
>>
>> We build version with symbols and frame-pointer enable and we use it in
>> conjunction with the perf tool to understand the situation.
>> # perf record -p $(pidof ovn-ic) -g --call-graph dwarf
>>
>> while a script is creating new resources, we capture the prof analysis and
>> as a result we obtained
>> # perf report -g
>>
>> Samples: 53K of event 'cpu-clock:pppH', Event count (approx.): 13339250000
>>   Children      Self  Command  Shared Object      Symbol
>> +   99.95%     1.24%  ovn-ic   ovn-ic             [.] main
>> +   99.93%     0.00%  ovn-ic   ovn-ic             [.] _start
>> +   99.93%     0.00%  ovn-ic   libc.so.6          [.] __libc_start_main
>> +   99.93%     0.00%  ovn-ic   libc.so.6          [.] 0x00007f6ba2cebd8f
>> +   58.40%     2.01%  ovn-ic   ovn-ic             [.]
>> ovsdb_idl_index_generic_comparer.part.0
>> +   58.34%     0.04%  ovn-ic   ovn-ic             [.] skiplist_find
>> +   57.82%     4.93%  ovn-ic   ovn-ic             [.] skiplist_forward_to_
>> +   57.82%     0.00%  ovn-ic   ovn-ic             [.] skiplist_forward_to
>> (inlined)
>> +   46.84%    10.29%  ovn-ic   ovn-ic             [.]
>> ovsdb_datum_compare_3way
>> +   38.25%     0.01%  ovn-ic   ovn-ic             [.] ovsdb_idl_index_find
>> +   37.93%     1.25%  ovn-ic   ovn-ic             [.] port_binding_run
>> +   20.33%     6.87%  ovn-ic   ovn-ic             [.]
>> ovsdb_atom_compare_3way
>> +   20.10%     0.01%  ovn-ic   ovn-ic             [.]
>> ovsdb_idl_cursor_first_eq
>> +   15.92%     0.02%  ovn-ic   ovn-ic             [.]
>> get_lrp_name_by_ts_port_name
>> +   13.44%    13.38%  ovn-ic   ovn-ic             [.] json_string
>> +    9.97%     0.20%  ovn-ic   ovn-ic             [.] ip46_parse_cidr
>> +    9.55%     9.49%  ovn-ic   ovn-ic             [.] ovsdb_idl_read
>> +    8.40%     0.00%  ovn-ic   libc.so.6          [.] 0x00007f6ba2e73806
>> +    8.37%     8.37%  ovn-ic   libc.so.6          [.] 0x00000000001b1806
>> +    7.53%     0.19%  ovn-ic   ovn-ic             [.] ip_parse_masked_len
>> +    7.32%     0.05%  ovn-ic   ovn-ic             [.] ip_parse_cidr
>> +    6.88%     4.64%  ovn-ic   ovn-ic             [.] smap_find__
>> +    6.79%     0.32%  ovn-ic   ovn-ic             [.] ovs_scan_len
>> +    6.46%     4.75%  ovn-ic   ovn-ic             [.] ovs_scan__
>> +    6.35%     0.03%  ovn-ic   ovn-ic             [.]
>> ovsdb_idl_cursor_next_eq
>> +    3.71%     0.09%  ovn-ic   ovn-ic             [.] smap_get
>> +    2.59%     0.04%  ovn-ic   ovn-ic             [.] smap_get_uuid
>> +    2.26%     0.06%  ovn-ic   ovn-ic             [.] ipv6_parse_cidr
>> +    2.16%     0.10%  ovn-ic   ovn-ic             [.] ipv6_parse_masked_len
>> +    2.16%     0.05%  ovn-ic   ovn-ic             [.] xasprintf
>> +    2.11%     0.16%  ovn-ic   ovn-ic             [.] xvasprintf
>> +    2.08%     0.12%  ovn-ic   ovn-ic             [.] ts_run
>> +    1.88%     0.00%  ovn-ic   libc.so.6          [.] 0x00007f6ba2e73b7e
>> +    1.87%     1.87%  ovn-ic   libc.so.6          [.] 0x00000000001b1b7e
>> +    1.87%     1.78%  ovn-ic   ovn-ic             [.] hash_bytes
>> +    1.66%     0.00%  ovn-ic   ovn-ic             [.] extract_lsp_addresses
>> +    1.66%     0.01%  ovn-ic   ovn-ic             [.]
>> parse_and_store_addresses
>>
>> In attached I share the result increasing  the zoom in on functions that
>> consume the most CPU time
>>
>> In each cycle of the loop, it goes through these 4 main functions that in
>> turn iterate over the main tables of the ovnsb_idl, ovnnb_idl, ovnisb_idl
>> and ovninb_idl. Following the concepts of Big O notation, the larger the
>> tables, the greater the processing consumption. We believe that this is
>> what we are seeing here.
>>
>> static void
>> ovn_db_run(struct ic_context *ctx,
>>            const struct icsbrec_availability_zone *az)
>> {
>>     ts_run(ctx);
>>     gateway_run(ctx, az);
>>     port_binding_run(ctx, az);
>>     route_run(ctx, az);
>> }
>>
>> To resolve the first behavior we have worked trying improve the
>> performance in this event loop in the main function of the process., we
>> apply a check to the state_change_idl->last_ovnsb_seqno attribute comparing
>> the current value with the last state to execute the loop only at times of
>> change and this approach proved to be efficient.
>>
>> Now, regarding the second behavior described above, remembering that
>> currently the ovn-ic process is single-thread, the solution is more
>> complex. I think the correct way to solve this scalability issue would be
>> to implement incremental processing before proposing a multi-thread system.
>>
> 
> I think adding incremental processing (I-P) support seems to be the right
> way to go.  Adding I-P should address the first concern too IMO.  But you
> can definitely submit a patch to address it and we can discuss it in the
> patch.
> 

I agree, it seems better to me to try to improve the processing step
instead of trying to throw threads at the problem.

> For the OVN community I think adding I-P for ovn-ic was not a priority.
> Probably that's the case with many of the deployments.  If you want to add
> I-P to ovn-ic,  I have no objections.  You have to do the heavy lifting
> though :)
> 
> @Dumitru Ceara <dce...@redhat.com> @Mark Michelson <mmich...@redhat.com>  @Han
> Zhou <hz...@ovn.org>   Thoughts ?
> 

Indeed, the performance of the ovn-ic daemon wasn't really a priority
until now.  That being said, I'm available to try to answer questions or
troubleshoot issues that might arise while implementing incremental
processing for ovn-ic.


> Thanks
> Numan
> 
> We would like to hear your thoughts on this matter and whether we are
>> approaching the topic correctly. Please let us know if there are any other
>> debugging commands that would help us with this investigation.
>>
>> Thank you in advance
>>
>> --
>> *Paulo Guilherme da Silva*
>> IaaS - Networking
>> guilherme.pa...@luizalabs.com
>>

Regards,
Dumitru

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to