On 4/1/25 3:15 AM, Ilia Baikov wrote:
> Hello,

Hi Ilia,

> So the things go way deeper and it becomes way strange as i initially
> thought.
> I've migrated to L3 networking using ovn-bgp-agent, in order to reduce
> ARP packets flooded over all ports attached to br-int. However this
> didn't help at all and some VMs loses external connectivity (but VMs
> ports didn't get flooded by hundreds pps of ARP).
> 
> 2025-04-01T01:03:51.368Z|00035|ofproto_dpif_xlate(handler12)|WARN|
> Dropped 131 log messages in last 55 seconds (most recently, 10 seconds
> ago) due to excessive rate
> 2025-04-01T01:03:51.368Z|00036|ofproto_dpif_xlate(handler12)|WARN|over
> 4096 resubmit actions on bridge br-int while processing
> udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
> 2025-04-01T01:04:35.392Z|00054|ofproto_dpif_upcall(handler60)|WARN|
> Dropped 133 log messages in last 57 seconds (most recently, 13 seconds
> ago) due to excessive rate
> 2025-04-01T01:04:35.392Z|00055|ofproto_dpif_upcall(handler60)|WARN|Flow:
> udp,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137
> 
> bridge("br-int")
> ----------------
>  0. in_port=277, priority 100, cookie 0xa70f8aad
>     set_field:0x5d/0xffff->reg13
>     set_field:0x2->reg11
>     set_field:0x3->reg12
>     set_field:0x2->metadata
>     set_field:0x2ac->reg14
>     set_field:0/0xffff0000->reg13
>     resubmit(,8)
>  8. metadata=0x2, priority 50, cookie 0xecab0c71
>     set_field:0/0x1000->reg10
>     resubmit(,73)
>     73. reg14=0x2ac,metadata=0x2, priority 80, cookie 0xa70f8aad
>             set_field:0x1000/0x1000->reg10
>     move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
>      -> NXM_NX_XXREG0[111] is now 0x1
>     resubmit(,9)
>  9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65
>     drop
> 
> Final flow:
> udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5d,reg14=0x2ac,metadata=0x2,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137
> Megaflow:
> recirc_id=0,eth,ip,in_port=277,dl_src=fa:16:3e:24:f1:f7,nw_frag=no
> Datapath actions: drop
> 2025-04-01T01:04:48.434Z|00015|ofproto_dpif_xlate(handler58)|WARN|
> Dropped 7 log messages in last 55 seconds (most recently, 13 seconds
> ago) due to excessive rate
> 2025-04-01T01:04:48.435Z|00016|ofproto_dpif_xlate(handler58)|WARN|over
> 4096 resubmit actions on bridge br-int while processing
> arp,in_port=487,vlan_tci=0x0000,dl_src=fa:16:3e:de:2a:ce,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=185.255.178.119,arp_tpa=185.255.178.1,arp_op=1,arp_sha=fa:16:3e:de:2a:ce,arp_tha=00:00:00:00:00:00
> 2025-04-01T01:05:51.377Z|00017|ofproto_dpif_xlate(handler46)|WARN|
> Dropped 7 log messages in last 61 seconds (most recently, 29 seconds
> ago) due to excessive rate
> 2025-04-01T01:05:51.377Z|00018|ofproto_dpif_xlate(handler46)|WARN|over
> 4096 resubmit actions on bridge br-int while processing
> udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
> 2025-04-01T01:05:51.377Z|00019|ofproto_dpif_upcall(handler46)|WARN|
> Dropped 8 log messages in last 63 seconds (most recently, 29 seconds
> ago) due to excessive rate
> 2025-04-01T01:05:51.377Z|00020|ofproto_dpif_upcall(handler46)|WARN|Flow:
> udp,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
> 
> bridge("br-int")
> ----------------
>  0. in_port=278, priority 100, cookie 0xd191f4de
>     set_field:0x5e/0xffff->reg13
>     set_field:0x2->reg11
>     set_field:0x3->reg12
>     set_field:0x2->metadata
>     set_field:0x33d->reg14
>     set_field:0/0xffff0000->reg13
>     resubmit(,8)
>  8. metadata=0x2, priority 50, cookie 0xecab0c71
>     set_field:0/0x1000->reg10
>     resubmit(,73)
>     73. reg14=0x33d,metadata=0x2, priority 80, cookie 0xd191f4de
>             set_field:0x1000/0x1000->reg10
>     move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
>      -> NXM_NX_XXREG0[111] is now 0x1
>     resubmit(,9)
>  9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65
>     drop
> 
> Final flow:
> udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5e,reg14=0x33d,metadata=0x2,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
> Megaflow:
> recirc_id=0,eth,ip,in_port=278,dl_src=fa:16:3e:66:0c:6e,nw_frag=no
> Datapath actions: drop
> 
> When i'm seeing this issue on mentioned IPs i can see that VM tries to
> resolve default gateway mac address but with no success since ARP is
> dropped somewhere because of resubmits and drop action on datapath. I
> guess now this is not related to OVN controller. Meanwhile DHCP works
> pretty good :)

Glad to hear about DHCP.

> Option broadcast-arps-to-all-routers is disabled at both of provider
> network Logical Switches.
> 
> Could there be any solution for this?

It's not a solution, more of a change to confirm whether this is causing
issues, but could you please try with this commit?

https://github.com/dceara/ovn/commit/fac23e2f6ef6effe3f9a2e0310e78d085750488b

The commit disables flooding of ARP requests that target OVN owned
router IPs to non-router ports.  It's not something that can be accepted
as is because it breaks some other things, e.g., OVN generated GARP
requests will not be forwarded properly.

The patch applies on top of the ovn main branch.  If you want to try it
out on older branches the easiest way is to just ignore the test changes.

Looking forward to hear how it went.

Thanks,
Dumitru

> 
> Regards,
> Ilia Baikov
> 
> 26.03.2025 12:34, Ilia Baikov пишет:
>> Got my hands on this, back to debugging. Seems like kernel runs stable
>> # uname -r
>> 6.14.0-061400-generic
>> Meanwhile there is no unrecognized(27) related logs.
>> tail -f /var/log/kolla/openvswitch/ovn-controller.log | grep -i "dhcp"
>> 2025-03-26T09:23:08.086Z|38050|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:9c:f4:45 185.255.178.131
>> 2025-03-26T09:23:08.086Z|38052|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>> 2025-03-26T09:23:11.084Z|38054|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:9c:f4:45 185.255.178.131
>> 2025-03-26T09:23:11.085Z|38056|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>> 2025-03-26T09:23:26.606Z|38058|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:9c:f4:45 185.255.178.131
>> 2025-03-26T09:23:26.606Z|38060|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>> 2025-03-26T09:23:27.704Z|38062|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:9c:f4:45 185.255.178.131
>> 2025-03-26T09:23:27.704Z|38064|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>> 2025-03-26T09:23:28.383Z|38066|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:9c:f4:45 185.255.178.131
>> 2025-03-26T09:23:28.383Z|38068|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>> 2025-03-26T09:23:53.984Z|38070|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:50:22:c4 185.255.178.170
>> 2025-03-26T09:23:53.984Z|38072|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xa020594| in-port=184| src-mac=fa:16:3e:50:22:c4, dst-
>> mac=30:b6:4f:5f:db:a0| src-ip=185.255.178.170, dst-ip=185.255.178.1
>> 2025-03-26T09:24:51.866Z|38074|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>> fa:16:3e:18:ac:4a 89.169.15.224
>> 2025-03-26T09:24:51.866Z|38076|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>> OF_Cookie_ID=0xa020594| in-port=156| src-mac=fa:16:3e:18:ac:4a, dst-
>> mac=30:b6:4f:5f:db:a0| src-ip=89.169.15.224, dst-ip=89.169.15.1
>>
>> And yes, there is logs about resubmit actions which are expected as
>> you said.
>> ->reg15
>>  continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>  continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>> OFPT_FLOW_MOD (OF1.5) (xid=0x235760): ADD table:8
>> priority=110,icmp6,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=2,icmp_code=0
>>  cookie:0xd1588295 
>> actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9)
>> OFPT_FLOW_MOD (OF1.5) (xid=0x235761): ADD table:8
>> priority=110,icmp,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=3,icmp_code=4
>>  cookie:0xd1588295 
>> actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9)
>> OFPT_FLOW_MOD (OF1.5) (xid=0x23577b): ADD table:80
>> priority=100,reg14=0x28a,metadata=0x2 cookie:0xa17d67d7
>> actions=set_field:0x9->reg11,set_field:0xa->reg12,resubmit(,8)
>> OFPT_FLOW_MOD (OF1.5) (xid=0x23577c): ADD table:43
>> priority=100,reg15=0x28a,metadata=0x2 cookie:0xa17d67d7
>> actions=set_field:0x1->reg15,resubmit(,43
>>
>> OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER
>> actions=set_field:0x2->metadata,set_field:0x503-
>> >reg14,resubmit(CONTROLLER,8) data_len=42
>> OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER
>> actions=set_field:0x2->metadata,set_field:0x503-
>> >reg14,resubmit(CONTROLLER,8) data_len=42
>>  continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>  continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>  continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>  continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>
>> top - 09:29:33 up 16:58,  3 users,  load average: 37.70, 37.39, 36.95
>> Threads: 114 total,  13 running, 101 sleeping,   0 stopped,   0 zombie
>> %Cpu(s): 27.5 us, 11.2 sy,  0.0 ni, 60.3 id,  0.5 wa,  0.0 hi, 0.6
>> si,  0.0 st
>> MiB Mem : 773901.8 total, 323685.3 free, 423865.0 used,  26351.4 buff/
>> cache
>> MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345358.7 avail
>> Mem
>>
>> #top -H -p $(pidof ovs-vswitchd)
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+
>> COMMAND
>>    6212 root      20   0 8388104 662016   7940 R  22.3   0.1 32:13.87
>> revalidator102
>>    6200 root      20   0 8388104 662016   7940 R  21.6   0.1 73:01.76
>> revalidator90
>>    6202 root      20   0 8388104 662016   7940 R  20.6   0.1 32:13.77
>> revalidator92
>>    6217 root      20   0 8388104 662016   7940 R  20.3   0.1 26:07.50
>> revalidator107
>>    6214 root      20   0 8388104 662016   7940 R  18.9   0.1 39:12.60
>> revalidator104
>>    6219 root      20   0 8388104 662016   7940 R  18.6   0.1 25:57.96
>> revalidator109
>>    6211 root      20   0 8388104 662016   7940 R  17.6   0.1 39:20.20
>> revalidator101
>>    6207 root      20   0 8388104 662016   7940 R  13.6   0.1 17:53.31
>> revalidator97
>>    6204 root      20   0 8388104 662016   7940 R  13.0   0.1 32:34.67
>> revalidator94
>>    6209 root      20   0 8388104 662016   7940 R  12.6   0.1 35:55.88
>> revalidator100
>>    6213 root      20   0 8388104 662016   7940 R  12.6   0.1 18:04.43
>> revalidator103
>>    6220 root      20   0 8388104 662016   7940 R  12.3   0.1 5:52.71
>> revalidator110
>>    6218 root      20   0 8388104 662016   7940 R  12.0   0.1 9:09.66
>> revalidator108
>>    6215 root      20   0 8388104 662016   7940 S   8.6   0.1 22:22.27
>> revalidator105
>>    6221 root      20   0 8388104 662016   7940 S   8.6   0.1 4:44.93
>> revalidator111
>>    6208 root      20   0 8388104 662016   7940 S   8.0   0.1 22:55.17
>> revalidator98
>>    6203 root      20   0 8388104 662016   7940 S   6.0   0.1 38:52.72
>> revalidator93
>>    6206 root      20   0 8388104 662016   7940 S   6.0   0.1 22:11.55
>> revalidator96
>>    6216 root      20   0 8388104 662016   7940 S   4.3   0.1 33:14.52
>> revalidator106
>>
>> Sometimes revalidator processes drops to 5-10%, sometimes to 30%. I
>> guess this behaviour is because of resubmit actions?. So far so good
>> controller feeling fine, but there could be some sort of freezes on
>> sending DHCPREPLY.
>>
>> top -H -p $(pidof ovn-controller)
>> top - 09:30:55 up 16:59,  3 users,  load average: 35.73, 36.70, 36.73
>> Threads:   5 total,   0 running,   5 sleeping,   0 stopped,   0 zombie
>> %Cpu(s): 27.7 us, 10.6 sy,  0.0 ni, 60.8 id,  0.3 wa,  0.0 hi, 0.6
>> si,  0.0 st
>> MiB Mem : 773901.8 total, 323758.4 free, 423785.6 used,  26357.8 buff/
>> cache
>> MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345438.0 avail
>> Mem
>>
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+
>> COMMAND
>>    5045 root      20   0  396500  83836   4432 S   0.7   0.0 11:46.59
>> ovn-controller
>>    5412 root      20   0  396500  83836   4432 S   0.0   0.0 0:06.23
>> ovn_pinctrl0
>>    5413 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.00
>> urcu1
>>    5414 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.20
>> ovn_statctrl2
>>    6103 root      20   0  396500  83836   4432 S   0.0   0.0 0:04.10
>> stopwatch3
>>
>>
>> Trying to reproduce on a real-world environment. There is 300
>> instances running with about 300Mbps network traffic in total.
>> Is there more logs or debug i can provide?
>>
> 

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to