On Tue, Jan 7, 2025 at 11:55 AM Felix Huettner
<felix.huettner@stackit.cloud> wrote:
>
> On Tue, Jan 07, 2025 at 11:31:49AM -0300, Tiago Pires via discuss wrote:
> > Hi guys,
> >
> > We followed the debug steps and using perf we got the following
> > performance trace:
> >
> > -   84.56%     0.01%  ovn-controller  [.] main
> >    - 84.56% main
> >       - 78.52% pinctrl_run
> >          - 39.14% lport_lookup_by_name
> >             - 36.66% ovsdb_idl_index_find
> >                - skiplist_find
> >                   - 36.36% skiplist_forward_to (inlined)
> >                      - skiplist_forward_to_
> >                         - 31.99%
> > ovsdb_idl_index_generic_comparer.part.0
> >                            - 24.68% ovsdb_datum_compare_3way
> >                               - 13.18% ovsdb_atom_compare_3way
> >                                    8.38% json_string
> >                                 6.27% __strcmp_evex
> >                              6.28% ovsdb_idl_read
> >             + 1.07% sbrec_port_binding_index_set_logical_port
> >               0.72% ovsdb_idl_index_destroy_row
> >               0.60% ovsdb_idl_index_init_row
> >          - 15.04% extract_ip_addresses
> >             - parse_and_store_addresses
> >                - 9.90% add_ipv6_netaddr
> >                   - 9.36% __GI_inet_ntop (inlined)
> >                      + inet_ntop6 (inlined)
> >                + 2.78% ipv6_parse_cidr_len
> >                + 2.14% ip_parse_cidr_len
> >          + 8.35% smap_get
> >          + 4.93% smap_get_int
> >          + 3.00% shash_find_data
> >          + 2.00% ipv6_parse
> >          + 1.31% eth_addr_from_string
> >       - 4.46% bfd_run
> >          - 2.00% sset_add
> >               1.43% hash_bytes
> >         0.81% encaps_run
>
> Hi Tiago,
>
> what i find interesting is that pinctrl_run never calls
> lport_lookup_by_name and extract_ip_addresses directly. So unfortunately its a
> little bit of guesswork where these calls come from, maybe they are
> called in multiple different functions and perf just added that up
> together.
>
> You could try compiling with framepointers and using that instead of
> dwarf for the call graph. But i am not sure if this helps.
>
> just wanted to drop that as my first feelings on this. Maybe it helps.
>
Hi Felix,

Yes, that is odd as you verified, the loop atom_arrays_compare_3way is
always checking the json messages and it seems there is a performance
problem there.
We are testing this patch [0] and checking if that will work as a
short term solution and we got in the lab a reduction of the database
update size from 1.1 MB to just 24 KB.

[0] 
https://github.com/ovn-org/ovn/commit/2a57b204459612a9a4edb49b9c5ebc44e101ee93#diff-97e16400e2bcbb4b65f7f3b8f2c05e9e8e56148df77719b71d60f235e3bcc0edR4379

Regards,

Tiago Pires

> >
> > Just to recall, only when the ovn-controller receives the update for
> > ovn-interconnect logical flows (that's not happen with regular logical
> > flows) that load spike happens.
> > We are looking into the lport_lookup_by_name function to try to
> > understand what is the relation with the ovn-interconnect logical
> > flows.
> >
> > Regards,
> >
> > Tiago Pires
> >
> > On Tue, Jan 7, 2025 at 5:54 AM Felix Huettner
> > <felix.huettner@stackit.cloud> wrote:
> > >
> > > On Sat, Jan 04, 2025 at 09:59:28PM -0500, Satish Patel wrote:
> > > > Hi Felix,
> > > >
> > > > Are there any documents on OVN troubleshooting the way you explain? LIke
> > > > what logs to look at, what tools to use to dump perf data and tech 
> > > > support
> > > > style data to drill down the issue before we report to developers. I
> > > > heard in the past related ovn-controller CPU usage 100% etc. But there 
> > > > is
> > > > no better way to find out what causes those spikes and errors.
> > >
> > > Hi Satish,
> > >
> > > i do not know of any documents regarding that. The questions below are
> > > based on my experience of looking at similar issues in the past.
> > >
> > > I also do not yet have a nice way to get perf information for spikes,
> > > but most of the time running inc-engine/recompute converts a spike to a
> > > continuous 100% cpu usage, which i find easier to analyse.
> > >
> > > To get the perf information you can do the following:
> > > * Ensure you have ovn-controller debug symbols available (that depends
> > >   on your installation method how to do that)
> > > * Install "perf". E.g. for ubuntu it would be something like "apt install 
> > > linux-tools-generic binutils libc-dbg"
> > > * Run "perf record -p $(pidof ovn-controller) -g --call-graph dwarf"
> > >   while the recomputes are running
> > > * Press CTRL+C after a few recompute iterations
> > > * Wait for "perf record" to finish. Do not press CTRL+C again until it
> > >   is done. Depending on how long it ran this can take some time.
> > > * Run "perf report -g". This will parse the profile and output a nice ui
> > > * You will see a list of functions and before them the percentage of CPU
> > >   time that the process spend in there (once in the function only and
> > >   once in the function and all subfunctions)
> > > * Expand the tree using the "+" key and follow mostly the top 2
> > >   percentages after expansion
> > > * Probably somewhere in there there is something to optimize (or that
> > >   has already been optimized)
> > >
> > > You will see something similar as the "performance trace" in this commit
> > > message 
> > > https://github.com/ovn-org/ovn/commit/ea619bf64ce1ae00996928f7ed15309a690ee3b9
> > > In that case the bfd_calculate_chassis was the one that was optimized.
> > >
> > > I hope that helps.
> > >
> > > Felix
> > >
> > >
> > > >
> > > > On Thu, Jan 2, 2025 at 9:48 AM Felix Huettner via discuss <
> > > > ovs-discuss@openvswitch.org> wrote:
> > > >
> > > > > On Thu, Jan 02, 2025 at 08:47:34AM -0300, Tiago Pires via discuss 
> > > > > wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > While creating ovn interconnect logical flows (Transit switch, LRPs
> > > > > > and LSPs) for a Logical Router in an AZ, there is load spike in all
> > > > > > chassis's ovn-controllers to 100%.
> > > > > > This remains for 1 minute and half and returns to normal after the
> > > > > > logical flows are installed.
> > > > > > Since this behavior only manifests itself while creating ovn
> > > > > > interconnect logical flows, enabling the debug logs shows that the
> > > > > > ovn-controller receives from SB the following change:
> > > > > >
> > > > > >
> > > > > 2024-12-26T17:38:15.023Z|10562646|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: SB_encap, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:15.023Z|10562647|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: mff_ovn_geneve, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:15.023Z|10562648|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: physical_flow_output, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:15.023Z|10562649|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: flow_output, old_state Unchanged, new_state Stale
> > > > > > 2024-12-26T17:38:15.032Z|10562650|jsonrpc|DBG|ssl:1XX.XX.XX.X16:6642:
> > > > > > received notification, method="update3",
> > > > > >
> > > > > params=[["monid","OVN_Southbound"],"9c54ec46-83a4-4deb-877d-3089a99ae482",{"SB_Global
> > > > > >
> > > > > ":{"019c4225-e637-4881-b382-36fdfcb9e0c9":{"modify":{"nb_cfg":34448}}},"Port_Binding":{"4230d38a-75cd-4f1c-9fde-62e753958656":{"insert":{"up":false,"tunnel_key":7,"mac":"16:00:01:f4:03:cb
> > > > > > 169.254 .30.3/24
> > > > > >
> > > > > fe80:30::3/64","datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-793b8","type":"patch"}},"f4d71615-88b8-4274-87ab-448cf8
> > > > > > c6a6ee":{"insert":{"up":false,"tunnel_key":6,"mac":"16:00:01:75:f8:87
> > > > > > 169.254.10.1/24
> > > > > >
> > > > > fe80:10::1/64","datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-
> > > > > >
> > > > > 9486-e66f9d7b25bf-lrp-ba453","type":"patch"}},"3eac57dd-386c-43b0-bf5b-8225941b52cb":{"insert":{"up":false,"tunnel_key":5,"mac":"16:00:01:9a:30:fa
> > > > > > 169.254.40.4/24 fe80:40::4/64","datapath":["uuid
> > > > > >
> > > > > ","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-36ac0","type":"patch"}},"7295d5c6-ff8d-4f4f-b1a5-eb1ca87fcb7d":{"insert":{"up":false,"tunnel_key
> > > > > > ":8,"mac":"16:00:01:de:df:b5 169.254.20.2/24
> > > > > >
> > > > > fe80:20::2/64","datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"logical_port":"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-c21c6","type":"patc
> > > > > >
> > > > > h"}}},"Logical_Flow":{"224852f2-9a46-4569-b51d-5d340ed6f4b4":{"modify":{"logical_dp_group":["uuid","bd825723-9a0e-451c-b5ad-b7d6f9cc75e4"]}},"05f5d515-bccf-47c1-9d55-8a4dfce88822":{"insert":{"pip
> > > > > > eline":"ingress","match":"inport ==
> > > > > > \"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-36ac0\" && ip6.dst ==
> > > > > >
> > > > > fe80::/64","priority":194,"logical_datapath":["uuid","e84addb8-efdd-48bc-8454-8dd76a4a5a19"],"
> > > > > > table_id":13,"actions":"ip.ttl--; reg8[0..15] = 0; xxreg0 = ip6.dst;
> > > > > > xxreg1 = fe80::1400:1ff:fe9a:30fa; eth.src = 16:00:01:9a:30:fa;
> > > > > > outport = \"bd7bae47-2482-41cc-9486-e66f9d7b25bf-lrp-36ac0\";
> > > > > > flags.loopback = 1;
> > > > > >
> > > > > next;"}},"c67ec1ee-bfc1-47e0-b35c-95fcc2adc13b":{"modify":{"logical_dp_group":
> > > > > >
> > > > > > The addresses from the above change (16:00:01:75:f8:87 
> > > > > > 169.254.10.1/24
> > > > > > fe80:10::1/64) are configured into the LRP connected to the transit
> > > > > > switch.
> > > > > >
> > > > > > After while another incremental processing is started:
> > > > > >
> > > > > > 2024-12-26T17:38:18.035Z|11190795|coverage|INFO|Skipping details of
> > > > > > duplicate event coverage for hash=db09e857
> > > > > > 2024-12-26T17:38:18.035Z|11190796|poll_loop|DBG|wakeup due to 
> > > > > > [POLLIN]
> > > > > > on fd 3 (1XX.XX.XX.X99:47382<->1XX.XX.XX.X16:6642) at
> > > > > > lib/stream-ssl.c:842 (100% CPU usage)
> > > > > > 2024-12-26T17:38:18.035Z|11190797|inc_proc_eng|DBG|Initializing new
> > > > > > run
> > > > > 2024-12-26T17:38:18.035Z|11190798|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: SB_sb_global, old_state Updated, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:18.035Z|11190799|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: northd_options, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:18.035Z|11190800|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: SB_dhcp_options, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:18.035Z|11190801|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: SB_dhcpv6_options, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:18.035Z|11190802|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: dhcp_options, old_state Unchanged, new_state Stale
> > > > > >
> > > > > 2024-12-26T17:38:18.035Z|11190803|inc_proc_eng|DBG|lib/inc-proc-eng.c:379:
> > > > > > node: SB_address_set, old_state Unchanged, new_state Stale
> > > > > >
> > > > > > In other chassis we observerd the recompute time increasing to 
> > > > > > 18secs:
> > > > > >
> > > > > > 2024-12-26T17:38:46.463Z|40096697|poll_loop|INFO|Dropped 72 log
> > > > > > messages in last 6 seconds (most recently, 0 seconds ago) due to
> > > > > > excessive rate 
> > > > > > 2024-12-26T17:38:46.463Z|40096698|poll_loop|INFO|wakeup
> > > > > > due to [POLLIN] on fd 26 (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at
> > > > > > lib/stream-ssl.c:842 (99% CPU usage)
> > > > > > 2024-12-26T17:38:52.442Z|40096699|poll_loop|INFO|Dropped 72 log
> > > > > > messages in last 6 seconds (most recently, 0 seconds ago) due to
> > > > > > excessive rate 
> > > > > > 2024-12-26T17:38:52.442Z|40096700|poll_loop|INFO|wakeup
> > > > > > due to [POLLIN] on fd 26 (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at
> > > > > > lib/stream-ssl.c:842 (102% CPU usage)
> > > > > > 2024-12-26T17:38:58.409Z|40096701|poll_loop|INFO|Dropped 72 log
> > > > > > messages in last 6 seconds (most recently, 0 seconds ago) due to
> > > > > > excessive rate 
> > > > > > 2024-12-26T17:38:58.409Z|40096702|poll_loop|INFO|wakeup
> > > > > > due to [POLLIN] on fd 26 (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at
> > > > > > lib/stream-ssl.c:842 (101% CPU usage)
> > > > > > 2024-12-26T17:39:04.477Z|40096703|poll_loop|INFO|Dropped 72 log
> > > > > > messages in last 6 seconds (most recently, 0 seconds ago) due to
> > > > > > excessive rate 
> > > > > > 2024-12-26T17:39:04.477Z|40096704|poll_loop|INFO|wakeup
> > > > > > due to [POLLIN] on fd 26 (1XX.XX.XX.6:56350<->1XX.XX.XX.X16:6642) at
> > > > > > lib/stream-ssl.c:842 (98% CPU usage)
> > > > > > 2024-12-26T17:39:27.896Z|40096705|inc_proc_eng|INFO|node:
> > > > > > logical_flow_output, recompute (forced) took 18932ms
> > > > > > 2024-12-26T17:39:28.612Z|40096706|inc_proc_eng|INFO|node:
> > > > > > physical_flow_output, recompute (forced) took 670ms
> > > > > > 2024-12-26T17:39:30.614Z|40096707|timeval|WARN|Unreasonably long
> > > > > > 22134ms poll interval (19780ms user, 2315ms system)
> > > > > > 2024-12-26T17:39:30.614Z|40096708|timeval|WARN|faults: 2634558 
> > > > > > minor,
> > > > > > 0 major 2024-12-26T17:39:30.614Z|40096709|timeval|WARN|disk: 0 
> > > > > > reads,
> > > > > > 8 writes 2024-12-26T17:39:30.614Z|40096710|timeval|WARN|context
> > > > > > switches: 0 voluntary, 2688 involuntary
> > > > > >
> > > > > > This behavior only happens with ovn interconnect logical flows, do 
> > > > > > you
> > > > > > know any known issue regarding it?
> > > > >
> > > > > Hi Tiago,
> > > > >
> > > > > from the log above it seems like some part of the previous incremental
> > > > > processing run failed and caused a recompute afterwards. So the last
> > > > > logs are no longer incrementally processed.
> > > > >
> > > > > There should be some log line in between similar to "...forcing 
> > > > > recompute
> > > > > next time...". That might help you find the cause of the recompute.
> > > > >
> > > > > Additionally you should be able to trigger a recompute manually with
> > > > > "ovn-appctl inc-engine/recompute". Then you will probably observe
> > > > > similar logs.
> > > > > You could then attach a perf to the ovn-controller for one or more
> > > > > recomputes and use "perf record" and "perf report" with call graphs to
> > > > > figure out what exactly takes long. That might be helpful information
> > > > > for further optimizations.
> > > > >
> > > > > Also ovn got significant improvements in incremental processing since
> > > > > 23.03, so an upgrade might help as well.
> > > > >
> > > > > Hope that helps
> > > > > Felix
> > > > >
> > > > >
> > > > > >
> > > > > > #Versions in use
> > > > > >
> > > > > > $ ovn-nbctl -V
> > > > > > ovn-nbctl 23.03.1
> > > > > > Open vSwitch Library 3.1.3
> > > > > > DB Schema 7.0.0
> > > > > >
> > > > > > $ ovs-vsctl -V
> > > > > > ovs-vsctl (Open vSwitch) 3.1.3
> > > > > > DB Schema 8.3.1
> > > > > >
> > > > > > external_ids: ovn-monitor-all="true",
> > > > > > ovn-ofctrl-wait-before-clear="10000", 
> > > > > > ovn-openflow-probe-interval="0",
> > > > > > ovn-remote-probe-interval="180000"
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Tiago Pires
> > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _‘Esta mensagem é direcionada apenas para os endereços constantes no
> > > > > > cabeçalho inicial. Se você não está listado nos endereços 
> > > > > > constantes no
> > > > > > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo 
> > > > > > dessa
> > > > > > mensagem e cuja cópia, encaminhamento e/ou execução das ações 
> > > > > > citadas
> > > > > estão
> > > > > > imediatamente anuladas e proibidas’._
> > > > > >
> > > > > >
> > > > > > * **‘Apesar do Magazine Luiza tomar
> > > > > > todas as precauções razoáveis para assegurar que nenhum vírus esteja
> > > > > > presente nesse e-mail, a empresa não poderá aceitar a 
> > > > > > responsabilidade
> > > > > por
> > > > > > quaisquer perdas ou danos causados por esse e-mail ou por seus 
> > > > > > anexos’.*
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > discuss mailing list
> > > > > > disc...@openvswitch.org
> > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > > > _______________________________________________
> > > > > discuss mailing list
> > > > > disc...@openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > > >
> >
> > --
> >
> >
> >
> >
> > _‘Esta mensagem é direcionada apenas para os endereços constantes no
> > cabeçalho inicial. Se você não está listado nos endereços constantes no
> > cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> > mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
> > imediatamente anuladas e proibidas’._
> >
> >
> > * **‘Apesar do Magazine Luiza tomar
> > todas as precauções razoáveis para assegurar que nenhum vírus esteja
> > presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por
> > quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
> >
> >
> >
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to