Hello,
So the things go way deeper and it becomes way strange as i initially thought. I've migrated to L3 networking using ovn-bgp-agent, in order to reduce ARP packets flooded over all ports attached to br-int. However this didn't help at all and some VMs loses external connectivity (but VMs ports didn't get flooded by hundreds pps of ARP).

2025-04-01T01:03:51.368Z|00035|ofproto_dpif_xlate(handler12)|WARN|Dropped 131 log messages in last 55 seconds (most recently, 10 seconds ago) due to excessive rate 2025-04-01T01:03:51.368Z|00036|ofproto_dpif_xlate(handler12)|WARN|over 4096 resubmit actions on bridge br-int while processing udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678 2025-04-01T01:04:35.392Z|00054|ofproto_dpif_upcall(handler60)|WARN|Dropped 133 log messages in last 57 seconds (most recently, 13 seconds ago) due to excessive rate 2025-04-01T01:04:35.392Z|00055|ofproto_dpif_upcall(handler60)|WARN|Flow: udp,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137

bridge("br-int")
----------------
 0. in_port=277, priority 100, cookie 0xa70f8aad
    set_field:0x5d/0xffff->reg13
    set_field:0x2->reg11
    set_field:0x3->reg12
    set_field:0x2->metadata
    set_field:0x2ac->reg14
    set_field:0/0xffff0000->reg13
    resubmit(,8)
 8. metadata=0x2, priority 50, cookie 0xecab0c71
    set_field:0/0x1000->reg10
    resubmit(,73)
    73. reg14=0x2ac,metadata=0x2, priority 80, cookie 0xa70f8aad
            set_field:0x1000/0x1000->reg10
    move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
     -> NXM_NX_XXREG0[111] is now 0x1
    resubmit(,9)
 9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65
    drop

Final flow: udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5d,reg14=0x2ac,metadata=0x2,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137
Megaflow: recirc_id=0,eth,ip,in_port=277,dl_src=fa:16:3e:24:f1:f7,nw_frag=no
Datapath actions: drop
2025-04-01T01:04:48.434Z|00015|ofproto_dpif_xlate(handler58)|WARN|Dropped 7 log messages in last 55 seconds (most recently, 13 seconds ago) due to excessive rate 2025-04-01T01:04:48.435Z|00016|ofproto_dpif_xlate(handler58)|WARN|over 4096 resubmit actions on bridge br-int while processing arp,in_port=487,vlan_tci=0x0000,dl_src=fa:16:3e:de:2a:ce,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=185.255.178.119,arp_tpa=185.255.178.1,arp_op=1,arp_sha=fa:16:3e:de:2a:ce,arp_tha=00:00:00:00:00:00 2025-04-01T01:05:51.377Z|00017|ofproto_dpif_xlate(handler46)|WARN|Dropped 7 log messages in last 61 seconds (most recently, 29 seconds ago) due to excessive rate 2025-04-01T01:05:51.377Z|00018|ofproto_dpif_xlate(handler46)|WARN|over 4096 resubmit actions on bridge br-int while processing udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678 2025-04-01T01:05:51.377Z|00019|ofproto_dpif_upcall(handler46)|WARN|Dropped 8 log messages in last 63 seconds (most recently, 29 seconds ago) due to excessive rate 2025-04-01T01:05:51.377Z|00020|ofproto_dpif_upcall(handler46)|WARN|Flow: udp,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678

bridge("br-int")
----------------
 0. in_port=278, priority 100, cookie 0xd191f4de
    set_field:0x5e/0xffff->reg13
    set_field:0x2->reg11
    set_field:0x3->reg12
    set_field:0x2->metadata
    set_field:0x33d->reg14
    set_field:0/0xffff0000->reg13
    resubmit(,8)
 8. metadata=0x2, priority 50, cookie 0xecab0c71
    set_field:0/0x1000->reg10
    resubmit(,73)
    73. reg14=0x33d,metadata=0x2, priority 80, cookie 0xd191f4de
            set_field:0x1000/0x1000->reg10
    move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
     -> NXM_NX_XXREG0[111] is now 0x1
    resubmit(,9)
 9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65
    drop

Final flow: udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5e,reg14=0x33d,metadata=0x2,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
Megaflow: recirc_id=0,eth,ip,in_port=278,dl_src=fa:16:3e:66:0c:6e,nw_frag=no
Datapath actions: drop

When i'm seeing this issue on mentioned IPs i can see that VM tries to resolve default gateway mac address but with no success since ARP is dropped somewhere because of resubmits and drop action on datapath. I guess now this is not related to OVN controller. Meanwhile DHCP works pretty good :) Option broadcast-arps-to-all-routers is disabled at both of provider network Logical Switches.

Could there be any solution for this?

Regards,
Ilia Baikov

26.03.2025 12:34, Ilia Baikov пишет:
Got my hands on this, back to debugging. Seems like kernel runs stable
# uname -r
6.14.0-061400-generic
Meanwhile there is no unrecognized(27) related logs.
tail -f /var/log/kolla/openvswitch/ovn-controller.log | grep -i "dhcp"
2025-03-26T09:23:08.086Z|38050|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:08.086Z|38052|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:11.084Z|38054|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:11.085Z|38056|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:26.606Z|38058|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:26.606Z|38060|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:27.704Z|38062|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:27.704Z|38064|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:28.383Z|38066|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:28.383Z|38068|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:53.984Z|38070|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:50:22:c4 185.255.178.170 2025-03-26T09:23:53.984Z|38072|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xa020594| in-port=184| src-mac=fa:16:3e:50:22:c4, dst-mac=30:b6:4f:5f:db:a0| src-ip=185.255.178.170, dst-ip=185.255.178.1 2025-03-26T09:24:51.866Z|38074|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:18:ac:4a 89.169.15.224 2025-03-26T09:24:51.866Z|38076|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xa020594| in-port=156| src-mac=fa:16:3e:18:ac:4a, dst-mac=30:b6:4f:5f:db:a0| src-ip=89.169.15.224, dst-ip=89.169.15.1

And yes, there is logs about resubmit actions which are expected as you said.
->reg15
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
OFPT_FLOW_MOD (OF1.5) (xid=0x235760): ADD table:8 priority=110,icmp6,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=2,icmp_code=0 cookie:0xd1588295 actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9) OFPT_FLOW_MOD (OF1.5) (xid=0x235761): ADD table:8 priority=110,icmp,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=3,icmp_code=4 cookie:0xd1588295 actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9) OFPT_FLOW_MOD (OF1.5) (xid=0x23577b): ADD table:80 priority=100,reg14=0x28a,metadata=0x2 cookie:0xa17d67d7 actions=set_field:0x9->reg11,set_field:0xa->reg12,resubmit(,8) OFPT_FLOW_MOD (OF1.5) (xid=0x23577c): ADD table:43 priority=100,reg15=0x28a,metadata=0x2 cookie:0xa17d67d7 actions=set_field:0x1->reg15,resubmit(,43

OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER actions=set_field:0x2->metadata,set_field:0x503->reg14,resubmit(CONTROLLER,8) data_len=42 OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER actions=set_field:0x2->metadata,set_field:0x503->reg14,resubmit(CONTROLLER,8) data_len=42
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)

top - 09:29:33 up 16:58,  3 users,  load average: 37.70, 37.39, 36.95
Threads: 114 total,  13 running, 101 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.5 us, 11.2 sy,  0.0 ni, 60.3 id,  0.5 wa,  0.0 hi, 0.6 si,  0.0 st MiB Mem : 773901.8 total, 323685.3 free, 423865.0 used,  26351.4 buff/cache MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345358.7 avail Mem

#top -H -p $(pidof ovs-vswitchd)
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+ COMMAND    6212 root      20   0 8388104 662016   7940 R  22.3   0.1 32:13.87 revalidator102    6200 root      20   0 8388104 662016   7940 R  21.6   0.1 73:01.76 revalidator90    6202 root      20   0 8388104 662016   7940 R  20.6   0.1 32:13.77 revalidator92    6217 root      20   0 8388104 662016   7940 R  20.3   0.1 26:07.50 revalidator107    6214 root      20   0 8388104 662016   7940 R  18.9   0.1 39:12.60 revalidator104    6219 root      20   0 8388104 662016   7940 R  18.6   0.1 25:57.96 revalidator109    6211 root      20   0 8388104 662016   7940 R  17.6   0.1 39:20.20 revalidator101    6207 root      20   0 8388104 662016   7940 R  13.6   0.1 17:53.31 revalidator97    6204 root      20   0 8388104 662016   7940 R  13.0   0.1 32:34.67 revalidator94    6209 root      20   0 8388104 662016   7940 R  12.6   0.1 35:55.88 revalidator100    6213 root      20   0 8388104 662016   7940 R  12.6   0.1 18:04.43 revalidator103    6220 root      20   0 8388104 662016   7940 R  12.3   0.1 5:52.71 revalidator110    6218 root      20   0 8388104 662016   7940 R  12.0   0.1 9:09.66 revalidator108    6215 root      20   0 8388104 662016   7940 S   8.6   0.1 22:22.27 revalidator105    6221 root      20   0 8388104 662016   7940 S   8.6   0.1 4:44.93 revalidator111    6208 root      20   0 8388104 662016   7940 S   8.0   0.1 22:55.17 revalidator98    6203 root      20   0 8388104 662016   7940 S   6.0   0.1 38:52.72 revalidator93    6206 root      20   0 8388104 662016   7940 S   6.0   0.1 22:11.55 revalidator96    6216 root      20   0 8388104 662016   7940 S   4.3   0.1 33:14.52 revalidator106

Sometimes revalidator processes drops to 5-10%, sometimes to 30%. I guess this behaviour is because of resubmit actions?. So far so good controller feeling fine, but there could be some sort of freezes on sending DHCPREPLY.

top -H -p $(pidof ovn-controller)
top - 09:30:55 up 16:59,  3 users,  load average: 35.73, 36.70, 36.73
Threads:   5 total,   0 running,   5 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.7 us, 10.6 sy,  0.0 ni, 60.8 id,  0.3 wa,  0.0 hi, 0.6 si,  0.0 st MiB Mem : 773901.8 total, 323758.4 free, 423785.6 used,  26357.8 buff/cache MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345438.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+ COMMAND    5045 root      20   0  396500  83836   4432 S   0.7   0.0 11:46.59 ovn-controller    5412 root      20   0  396500  83836   4432 S   0.0   0.0 0:06.23 ovn_pinctrl0    5413 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.00 urcu1    5414 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.20 ovn_statctrl2    6103 root      20   0  396500  83836   4432 S   0.0   0.0 0:04.10 stopwatch3


Trying to reproduce on a real-world environment. There is 300 instances running with about 300Mbps network traffic in total.
Is there more logs or debug i can provide?

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to