Hi everyone, I wrote this email to share with the community the behavior we are observing in our infrastructure, the high processing of ovn-ic.
We can simulate the behavior using ovn-fake-multinode running in a sandbox. At the moment we're using 24.03 OVN version. How you can see, we have 3 zones root@vm-se1-paulo:~/ovn-fake-multinode# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 15bb7e2d21db localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-central-az1-1 8c21baf990b8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-central-az2-1 54fc243cbb3c localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-central-az3-1 aac92051d8a3 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-1 c053e82326a7 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-2 25705f7b100f localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-3 ebd07e74b2f8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-4 72f8c45178f8 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-5 43ca78b73401 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-6 b055c8d42860 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-7 7fea15004dd9 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-8 0349d294cc07 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-9 2fa3d537a506 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-10 26c07aff9b78 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-11 83210fb30a91 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-gw-12 b4dff8b37518 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-chassis-1 606655db8d8b localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-chassis-2 d45da63d8713 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-chassis-3 4b960252e7a3 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-chassis-4 56ecfdbd4580 localhost/ovn/ovn-multi-node:latest /usr/sbin/init 9 days ago Up 9 days ago ovn-chassis-5 We currently have 3000 routers deployed in each zone of our sdn. And with this value since we can see load and the impact on ovn-ic daemon processing. 1. Even when we don't have new resources being processed, the cpu load fluctuantes between 80% and 99% of cpu time, all the time. 2. When we created new resources, the load got close in 99% of time cpu, until the end of new deployments. Our concern is that ovn-ic will not be able to scale to future demand, since the number of routers is expected to grow in the coming months. We build version with symbols and frame-pointer enable and we use it in conjunction with the perf tool to understand the situation. # perf record -p $(pidof ovn-ic) -g --call-graph dwarf while a script is creating new resources, we capture the prof analysis and as a result we obtained # perf report -g Samples: 53K of event 'cpu-clock:pppH', Event count (approx.): 13339250000 Children Self Command Shared Object Symbol + 99.95% 1.24% ovn-ic ovn-ic [.] main + 99.93% 0.00% ovn-ic ovn-ic [.] _start + 99.93% 0.00% ovn-ic libc.so.6 [.] __libc_start_main + 99.93% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2cebd8f + 58.40% 2.01% ovn-ic ovn-ic [.] ovsdb_idl_index_generic_comparer.part.0 + 58.34% 0.04% ovn-ic ovn-ic [.] skiplist_find + 57.82% 4.93% ovn-ic ovn-ic [.] skiplist_forward_to_ + 57.82% 0.00% ovn-ic ovn-ic [.] skiplist_forward_to (inlined) + 46.84% 10.29% ovn-ic ovn-ic [.] ovsdb_datum_compare_3way + 38.25% 0.01% ovn-ic ovn-ic [.] ovsdb_idl_index_find + 37.93% 1.25% ovn-ic ovn-ic [.] port_binding_run + 20.33% 6.87% ovn-ic ovn-ic [.] ovsdb_atom_compare_3way + 20.10% 0.01% ovn-ic ovn-ic [.] ovsdb_idl_cursor_first_eq + 15.92% 0.02% ovn-ic ovn-ic [.] get_lrp_name_by_ts_port_name + 13.44% 13.38% ovn-ic ovn-ic [.] json_string + 9.97% 0.20% ovn-ic ovn-ic [.] ip46_parse_cidr + 9.55% 9.49% ovn-ic ovn-ic [.] ovsdb_idl_read + 8.40% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73806 + 8.37% 8.37% ovn-ic libc.so.6 [.] 0x00000000001b1806 + 7.53% 0.19% ovn-ic ovn-ic [.] ip_parse_masked_len + 7.32% 0.05% ovn-ic ovn-ic [.] ip_parse_cidr + 6.88% 4.64% ovn-ic ovn-ic [.] smap_find__ + 6.79% 0.32% ovn-ic ovn-ic [.] ovs_scan_len + 6.46% 4.75% ovn-ic ovn-ic [.] ovs_scan__ + 6.35% 0.03% ovn-ic ovn-ic [.] ovsdb_idl_cursor_next_eq + 3.71% 0.09% ovn-ic ovn-ic [.] smap_get + 2.59% 0.04% ovn-ic ovn-ic [.] smap_get_uuid + 2.26% 0.06% ovn-ic ovn-ic [.] ipv6_parse_cidr + 2.16% 0.10% ovn-ic ovn-ic [.] ipv6_parse_masked_len + 2.16% 0.05% ovn-ic ovn-ic [.] xasprintf + 2.11% 0.16% ovn-ic ovn-ic [.] xvasprintf + 2.08% 0.12% ovn-ic ovn-ic [.] ts_run + 1.88% 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73b7e + 1.87% 1.87% ovn-ic libc.so.6 [.] 0x00000000001b1b7e + 1.87% 1.78% ovn-ic ovn-ic [.] hash_bytes + 1.66% 0.00% ovn-ic ovn-ic [.] extract_lsp_addresses + 1.66% 0.01% ovn-ic ovn-ic [.] parse_and_store_addresses In attached I share the result increasing the zoom in on functions that consume the most CPU time In each cycle of the loop, it goes through these 4 main functions that in turn iterate over the main tables of the ovnsb_idl, ovnnb_idl, ovnisb_idl and ovninb_idl. Following the concepts of Big O notation, the larger the tables, the greater the processing consumption. We believe that this is what we are seeing here. static void ovn_db_run(struct ic_context *ctx, const struct icsbrec_availability_zone *az) { ts_run(ctx); gateway_run(ctx, az); port_binding_run(ctx, az); route_run(ctx, az); } To resolve the first behavior we have worked trying improve the performance in this event loop in the main function of the process., we apply a check to the state_change_idl->last_ovnsb_seqno attribute comparing the current value with the last state to execute the loop only at times of change and this approach proved to be efficient. Now, regarding the second behavior described above, remembering that currently the ovn-ic process is single-thread, the solution is more complex. I think the correct way to solve this scalability issue would be to implement incremental processing before proposing a multi-thread system. We would like to hear your thoughts on this matter and whether we are approaching the topic correctly. Please let us know if there are any other debugging commands that would help us with this investigation. Thank you in advance -- *Paulo Guilherme da Silva* IaaS - Networking guilherme.pa...@luizalabs.com -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
- 13.38% ovn-ic ovn-ic [.] json_string 13.38% _start __libc_start_main 0x7f6ba2cebd8f - main + 6.86% port_binding_run + 2.31% get_lrp_name_by_ts_port_name + 2.02% ovsdb_idl_cursor_first_eq + 1.43% ovsdb_idl_index_find + 0.77% ovsdb_idl_cursor_next_eq - 10.29% ovn-ic ovn-ic [.] ovsdb_datum_compare_3way - 36.55% ovsdb_datum_compare_3way + 20.33% ovsdb_atom_compare_3way 7.92% 0x7f6ba2e73806 1.83% 0x7f6ba2e73b7e 1.03% 0x7f6ba2e73813 0.77% 0x7f6ba2e73823 0.68% 0x7f6ba2e7380c 0.52% 0x7f6ba2e7381f - 10.29% _start __libc_start_main 0x7f6ba2cebd8f ▒ - main + 4.79% port_binding_run + 2.11% get_lrp_name_by_ts_port_name + 1.38% ovsdb_idl_cursor_first_eq + 1.09% ovsdb_idl_index_find + 0.92% ovsdb_idl_cursor_next_eq - 9.49% ovn-ic ovn-ic [.] ovsdb_idl_read 9.49% _start __libc_start_main 0x7f6ba2cebd8f - main + 4.31% port_binding_run + 2.04% get_lrp_name_by_ts_port_name + 1.20% ovsdb_idl_cursor_first_eq + 1.07% ovsdb_idl_index_find + 0.87% ovsdb_idl_cursor_next_eq - 10.29% ovn-ic ovn-ic [.] ovsdb_datum_compare_3way ▒ - 36.55% ovsdb_datum_compare_3way ▒ + 20.33% ovsdb_atom_compare_3way ▒ 7.92% 0x7f6ba2e73806 ▒ 1.83% 0x7f6ba2e73b7e ▒ 1.03% 0x7f6ba2e73813 ▒ 0.77% 0x7f6ba2e73823 ▒ 0.68% 0x7f6ba2e7380c ▒ 0.52% 0x7f6ba2e7381f ▒ - 10.29% _start ▒ __libc_start_main ▒ 0x7f6ba2cebd8f ▒ - main ▒ + 4.79% port_binding_run ▒ + 2.11% get_lrp_name_by_ts_port_name ▒ + 1.38% ovsdb_idl_cursor_first_eq ▒ + 1.09% ovsdb_idl_index_find ▒ + 0.92% ovsdb_idl_cursor_next_eq ▒ + 0.01% ovn-ic ovn-ic [.] ovsdb_idl_index_find ▒ + 1.25% ovn-ic ovn-ic [.] port_binding_run ▒ + 6.87% ovn-ic ovn-ic [.] ovsdb_atom_compare_3way ▒ + 0.01% ovn-ic ovn-ic [.] ovsdb_idl_cursor_first_eq ▒ + 0.02% ovn-ic ovn-ic [.] get_lrp_name_by_ts_port_name ▒ - 13.38% ovn-ic ovn-ic [.] json_string ▒ 13.38% _start ▒ __libc_start_main ▒ 0x7f6ba2cebd8f ▒ - main ▒ + 6.86% port_binding_run ▒ + 2.31% get_lrp_name_by_ts_port_name ▒ + 2.02% ovsdb_idl_cursor_first_eq ▒ + 1.43% ovsdb_idl_index_find ▒ + 0.77% ovsdb_idl_cursor_next_eq ▒ + 0.20% ovn-ic ovn-ic [.] ip46_parse_cidr - 9.49% ovn-ic ovn-ic [.] ovsdb_idl_read ▒ 9.49% _start ▒ __libc_start_main ▒ 0x7f6ba2cebd8f ▒ - main ▒ + 4.31% port_binding_run ▒ + 2.04% get_lrp_name_by_ts_port_name ▒ + 1.20% ovsdb_idl_cursor_first_eq ▒ + 1.07% ovsdb_idl_index_find ▒ + 0.87% ovsdb_idl_cursor_next_eq ▒ + 0.00% ovn-ic libc.so.6 [.] 0x00007f6ba2e73806 ▒ - 8.37% ovn-ic libc.so.6 [.] 0x00000000001b1806 ▒ _start ▒ __libc_start_main ▒ 0x7f6ba2cebd8f ▒ - main ▒ + 4.78% port_binding_run ▒ + 1.54% get_lrp_name_by_ts_port_name ▒ + 1.24% ovsdb_idl_cursor_first_eq ▒ + 0.75% ovsdb_idl_index_find - 4.64% ovn-ic ovn-ic [.] smap_find__ ▒ - 4.64% _start ▒ __libc_start_main ▒ 0x7f6ba2cebd8f ▒ - main ▒ + 1.33% smap_get ▒ + 1.19% port_binding_run ▒ + 1.18% smap_get_uuid ▒ - 2.24% smap_find__ ▒ 1.48% 0x7f6ba2e74d79
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss