On Tue, Apr 8, 2025 at 5:57 PM Paulo Guilherme Da Silva via discuss <
[email protected]> wrote:
> Hi everyone,
>
> I wrote this email to share with the community the behavior we are
> observing in our infrastructure, the high processing of ovn-ic.
>
> We can simulate the behavior using ovn-fake-multinode running in a
> sandbox. At the moment we're using 24.03 OVN version.
>
> How you can see, we have 3 zones
>
> root@vm-se1-paulo:~/ovn-fake-multinode# podman ps
> CONTAINER ID IMAGE COMMAND CREATED
> STATUS PORTS NAMES
> 15bb7e2d21db localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-central-az1-1
> 8c21baf990b8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-central-az2-1
> 54fc243cbb3c localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-central-az3-1
> aac92051d8a3 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-1
> c053e82326a7 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-2
> 25705f7b100f localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-3
> ebd07e74b2f8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-4
> 72f8c45178f8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-5
> 43ca78b73401 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-6
> b055c8d42860 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-7
> 7fea15004dd9 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-8
> 0349d294cc07 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-9
> 2fa3d537a506 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-10
> 26c07aff9b78 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-11
> 83210fb30a91 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-gw-12
> b4dff8b37518 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-chassis-1
> 606655db8d8b localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-chassis-2
> d45da63d8713 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-chassis-3
> 4b960252e7a3 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-chassis-4
> 56ecfdbd4580 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days
> ago Up 9 days ago ovn-chassis-5
>
>
> We currently have 3000 routers deployed in each zone of our sdn. And with
> this value since we can see load and the impact on ovn-ic daemon processing.
>
> 1. Even when we don't have new resources being processed, the cpu load
> fluctuantes between 80% and 99% of cpu time, all the time.
>
> 2. When we created new resources, the load got close in 99% of time cpu,
> until the end of new deployments.
>
> Our concern is that ovn-ic will not be able to scale to future demand,
> since the number of routers is expected to grow in the coming months.
>
> We build version with symbols and frame-pointer enable and we use it in
> conjunction with the perf tool to understand the situation.
> # perf record -p $(pidof ovn-ic) -g --call-graph dwarf
>
> while a script is creating new resources, we capture the prof analysis and
> as a result we obtained
> # perf report -g
>
> Samples: 53K of event 'cpu-clock:pppH', Event count (approx.): 13339250000
> Children Self Command Shared Object Symbol
> + 99.95% 1.24% ovn-ic ovn-ic [.] main
> + 99.93% 0.00% ovn-ic ovn-ic [.] _start
> + 99.93% 0.00% ovn-ic libc.so.6 [.] __libc_start_main
> + 99.93% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2cebd8f
> + 58.40% 2.01% ovn-ic ovn-ic [.]
> ovsdb_idl_index_generic_comparer.part.0
> + 58.34% 0.04% ovn-ic ovn-ic [.] skiplist_find
> + 57.82% 4.93% ovn-ic ovn-ic [.] skiplist_forward_to_
> + 57.82% 0.00% ovn-ic ovn-ic [.] skiplist_forward_to
> (inlined)
> + 46.84% 10.29% ovn-ic ovn-ic [.]
> ovsdb_datum_compare_3way
> + 38.25% 0.01% ovn-ic ovn-ic [.] ovsdb_idl_index_find
> + 37.93% 1.25% ovn-ic ovn-ic [.] port_binding_run
> + 20.33% 6.87% ovn-ic ovn-ic [.]
> ovsdb_atom_compare_3way
> + 20.10% 0.01% ovn-ic ovn-ic [.]
> ovsdb_idl_cursor_first_eq
> + 15.92% 0.02% ovn-ic ovn-ic [.]
> get_lrp_name_by_ts_port_name
> + 13.44% 13.38% ovn-ic ovn-ic [.] json_string
> + 9.97% 0.20% ovn-ic ovn-ic [.] ip46_parse_cidr
> + 9.55% 9.49% ovn-ic ovn-ic [.] ovsdb_idl_read
> + 8.40% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73806
> + 8.37% 8.37% ovn-ic libc.so.6 [.] 0x00000000001b1806
> + 7.53% 0.19% ovn-ic ovn-ic [.] ip_parse_masked_len
> + 7.32% 0.05% ovn-ic ovn-ic [.] ip_parse_cidr
> + 6.88% 4.64% ovn-ic ovn-ic [.] smap_find__
> + 6.79% 0.32% ovn-ic ovn-ic [.] ovs_scan_len
> + 6.46% 4.75% ovn-ic ovn-ic [.] ovs_scan__
> + 6.35% 0.03% ovn-ic ovn-ic [.]
> ovsdb_idl_cursor_next_eq
> + 3.71% 0.09% ovn-ic ovn-ic [.] smap_get
> + 2.59% 0.04% ovn-ic ovn-ic [.] smap_get_uuid
> + 2.26% 0.06% ovn-ic ovn-ic [.] ipv6_parse_cidr
> + 2.16% 0.10% ovn-ic ovn-ic [.] ipv6_parse_masked_len
> + 2.16% 0.05% ovn-ic ovn-ic [.] xasprintf
> + 2.11% 0.16% ovn-ic ovn-ic [.] xvasprintf
> + 2.08% 0.12% ovn-ic ovn-ic [.] ts_run
> + 1.88% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73b7e
> + 1.87% 1.87% ovn-ic libc.so.6 [.] 0x00000000001b1b7e
> + 1.87% 1.78% ovn-ic ovn-ic [.] hash_bytes
> + 1.66% 0.00% ovn-ic ovn-ic [.] extract_lsp_addresses
> + 1.66% 0.01% ovn-ic ovn-ic [.]
> parse_and_store_addresses
>
> In attached I share the result increasing the zoom in on functions that
> consume the most CPU time
>
> In each cycle of the loop, it goes through these 4 main functions that in
> turn iterate over the main tables of the ovnsb_idl, ovnnb_idl, ovnisb_idl
> and ovninb_idl. Following the concepts of Big O notation, the larger the
> tables, the greater the processing consumption. We believe that this is
> what we are seeing here.
>
> static void
> ovn_db_run(struct ic_context *ctx,
> const struct icsbrec_availability_zone *az)
> {
> ts_run(ctx);
> gateway_run(ctx, az);
> port_binding_run(ctx, az);
> route_run(ctx, az);
> }
>
> To resolve the first behavior we have worked trying improve the
> performance in this event loop in the main function of the process., we
> apply a check to the state_change_idl->last_ovnsb_seqno attribute comparing
> the current value with the last state to execute the loop only at times of
> change and this approach proved to be efficient.
>
> Now, regarding the second behavior described above, remembering that
> currently the ovn-ic process is single-thread, the solution is more
> complex. I think the correct way to solve this scalability issue would be
> to implement incremental processing before proposing a multi-thread system.
>
I think adding incremental processing (I-P) support seems to be the right
way to go. Adding I-P should address the first concern too IMO. But you
can definitely submit a patch to address it and we can discuss it in the
patch.
For the OVN community I think adding I-P for ovn-ic was not a priority.
Probably that's the case with many of the deployments. If you want to add
I-P to ovn-ic, I have no objections. You have to do the heavy lifting
though :)
@Dumitru Ceara <[email protected]> @Mark Michelson <[email protected]> @Han
Zhou <[email protected]> Thoughts ?
Thanks
Numan
We would like to hear your thoughts on this matter and whether we are
> approaching the topic correctly. Please let us know if there are any other
> debugging commands that would help us with this investigation.
>
> Thank you in advance
>
> --
> *Paulo Guilherme da Silva*
> IaaS - Networking
> [email protected]
>
>
>
>
>
>
> *‘Esta mensagem é direcionada apenas para os endereços constantes no
> cabeçalho inicial. Se você não está listado nos endereços constantes no
> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
> imediatamente anuladas e proibidas’.*
>
> *‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para
> assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não
> poderá aceitar a responsabilidade por quaisquer perdas ou danos causados
> por esse e-mail ou por seus anexos’.*
> _______________________________________________
> discuss mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss