On 4/1/25 7:30 PM, Ilia Baikov wrote:
> Hi Dumitru,

Hi Ilia,

> Sure, let's give it a try. Is it good idea to apply this patch on top of
> the patch you previously sent to try?

Yes, you're right, we should apply it on top of that.  I pushed both
patches here:

https://github.com/dceara/ovn/commits/refs/heads/tmp-branch-24.09-revert-mg-split/

Please ignore the CI failures, like mentioned yesterday the last commit
I did is known to break stuff, it's just to confirm that our problem in
this case is due to the MC_FLOOD_L2 output we do for ARP requests
targetting OVN router IPs.

> What topology for testing is more preferrable to try it out? L2 or L3
> (bgp-based networking)? In case of L3 there is actually small amount of
> ARP pps compared to L2 where uplink interface (like eno1) is member of
> br-ex bridge.
> 

Would it be possible to try both?

> Thank you for your involvement, really appreciate it!
> 

No problem, thanks for taking time to help us try things out!

Regards,
Dumitru

> 01.04.2025 15:06, Dumitru Ceara wrote:
>> On 4/1/25 3:15 AM, Ilia Baikov wrote:
>>> Hello,
>> Hi Ilia,
>>
>>> So the things go way deeper and it becomes way strange as i initially
>>> thought.
>>> I've migrated to L3 networking using ovn-bgp-agent, in order to reduce
>>> ARP packets flooded over all ports attached to br-int. However this
>>> didn't help at all and some VMs loses external connectivity (but VMs
>>> ports didn't get flooded by hundreds pps of ARP).
>>>
>>> 2025-04-01T01:03:51.368Z|00035|ofproto_dpif_xlate(handler12)|WARN|
>>> Dropped 131 log messages in last 55 seconds (most recently, 10 seconds
>>> ago) due to excessive rate
>>> 2025-04-01T01:03:51.368Z|00036|ofproto_dpif_xlate(handler12)|WARN|over
>>> 4096 resubmit actions on bridge br-int while processing
>>> udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
>>> 2025-04-01T01:04:35.392Z|00054|ofproto_dpif_upcall(handler60)|WARN|
>>> Dropped 133 log messages in last 57 seconds (most recently, 13 seconds
>>> ago) due to excessive rate
>>> 2025-04-01T01:04:35.392Z|00055|ofproto_dpif_upcall(handler60)|WARN|Flow:
>>> udp,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137
>>>
>>> bridge("br-int")
>>> ----------------
>>>   0. in_port=277, priority 100, cookie 0xa70f8aad
>>>      set_field:0x5d/0xffff->reg13
>>>      set_field:0x2->reg11
>>>      set_field:0x3->reg12
>>>      set_field:0x2->metadata
>>>      set_field:0x2ac->reg14
>>>      set_field:0/0xffff0000->reg13
>>>      resubmit(,8)
>>>   8. metadata=0x2, priority 50, cookie 0xecab0c71
>>>      set_field:0/0x1000->reg10
>>>      resubmit(,73)
>>>      73. reg14=0x2ac,metadata=0x2, priority 80, cookie 0xa70f8aad
>>>              set_field:0x1000/0x1000->reg10
>>>      move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
>>>       -> NXM_NX_XXREG0[111] is now 0x1
>>>      resubmit(,9)
>>>   9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65
>>>      drop
>>>
>>> Final flow:
>>> udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5d,reg14=0x2ac,metadata=0x2,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137
>>> Megaflow:
>>> recirc_id=0,eth,ip,in_port=277,dl_src=fa:16:3e:24:f1:f7,nw_frag=no
>>> Datapath actions: drop
>>> 2025-04-01T01:04:48.434Z|00015|ofproto_dpif_xlate(handler58)|WARN|
>>> Dropped 7 log messages in last 55 seconds (most recently, 13 seconds
>>> ago) due to excessive rate
>>> 2025-04-01T01:04:48.435Z|00016|ofproto_dpif_xlate(handler58)|WARN|over
>>> 4096 resubmit actions on bridge br-int while processing
>>> arp,in_port=487,vlan_tci=0x0000,dl_src=fa:16:3e:de:2a:ce,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=185.255.178.119,arp_tpa=185.255.178.1,arp_op=1,arp_sha=fa:16:3e:de:2a:ce,arp_tha=00:00:00:00:00:00
>>> 2025-04-01T01:05:51.377Z|00017|ofproto_dpif_xlate(handler46)|WARN|
>>> Dropped 7 log messages in last 61 seconds (most recently, 29 seconds
>>> ago) due to excessive rate
>>> 2025-04-01T01:05:51.377Z|00018|ofproto_dpif_xlate(handler46)|WARN|over
>>> 4096 resubmit actions on bridge br-int while processing
>>> udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
>>> 2025-04-01T01:05:51.377Z|00019|ofproto_dpif_upcall(handler46)|WARN|
>>> Dropped 8 log messages in last 63 seconds (most recently, 29 seconds
>>> ago) due to excessive rate
>>> 2025-04-01T01:05:51.377Z|00020|ofproto_dpif_upcall(handler46)|WARN|Flow:
>>> udp,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
>>>
>>> bridge("br-int")
>>> ----------------
>>>   0. in_port=278, priority 100, cookie 0xd191f4de
>>>      set_field:0x5e/0xffff->reg13
>>>      set_field:0x2->reg11
>>>      set_field:0x3->reg12
>>>      set_field:0x2->metadata
>>>      set_field:0x33d->reg14
>>>      set_field:0/0xffff0000->reg13
>>>      resubmit(,8)
>>>   8. metadata=0x2, priority 50, cookie 0xecab0c71
>>>      set_field:0/0x1000->reg10
>>>      resubmit(,73)
>>>      73. reg14=0x33d,metadata=0x2, priority 80, cookie 0xd191f4de
>>>              set_field:0x1000/0x1000->reg10
>>>      move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
>>>       -> NXM_NX_XXREG0[111] is now 0x1
>>>      resubmit(,9)
>>>   9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65
>>>      drop
>>>
>>> Final flow:
>>> udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5e,reg14=0x33d,metadata=0x2,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678
>>> Megaflow:
>>> recirc_id=0,eth,ip,in_port=278,dl_src=fa:16:3e:66:0c:6e,nw_frag=no
>>> Datapath actions: drop
>>>
>>> When i'm seeing this issue on mentioned IPs i can see that VM tries to
>>> resolve default gateway mac address but with no success since ARP is
>>> dropped somewhere because of resubmits and drop action on datapath. I
>>> guess now this is not related to OVN controller. Meanwhile DHCP works
>>> pretty good :)
>> Glad to hear about DHCP.
>>
>>> Option broadcast-arps-to-all-routers is disabled at both of provider
>>> network Logical Switches.
>>>
>>> Could there be any solution for this?
>> It's not a solution, more of a change to confirm whether this is causing
>> issues, but could you please try with this commit?
>>
>> https://github.com/dceara/ovn/commit/
>> fac23e2f6ef6effe3f9a2e0310e78d085750488b
>>
>> The commit disables flooding of ARP requests that target OVN owned
>> router IPs to non-router ports.  It's not something that can be accepted
>> as is because it breaks some other things, e.g., OVN generated GARP
>> requests will not be forwarded properly.
>>
>> The patch applies on top of the ovn main branch.  If you want to try it
>> out on older branches the easiest way is to just ignore the test changes.
>>
>> Looking forward to hear how it went.
>>
>> Thanks,
>> Dumitru
>>
>>> Regards,
>>> Ilia Baikov
>>>
>>> 26.03.2025 12:34, Ilia Baikov пишет:
>>>> Got my hands on this, back to debugging. Seems like kernel runs stable
>>>> # uname -r
>>>> 6.14.0-061400-generic
>>>> Meanwhile there is no unrecognized(27) related logs.
>>>> tail -f /var/log/kolla/openvswitch/ovn-controller.log | grep -i "dhcp"
>>>> 2025-03-26T09:23:08.086Z|38050|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:9c:f4:45 185.255.178.131
>>>> 2025-03-26T09:23:08.086Z|38052|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>>>> 2025-03-26T09:23:11.084Z|38054|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:9c:f4:45 185.255.178.131
>>>> 2025-03-26T09:23:11.085Z|38056|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>>>> 2025-03-26T09:23:26.606Z|38058|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:9c:f4:45 185.255.178.131
>>>> 2025-03-26T09:23:26.606Z|38060|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>>>> 2025-03-26T09:23:27.704Z|38062|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:9c:f4:45 185.255.178.131
>>>> 2025-03-26T09:23:27.704Z|38064|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>>>> 2025-03-26T09:23:28.383Z|38066|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:9c:f4:45 185.255.178.131
>>>> 2025-03-26T09:23:28.383Z|38068|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-
>>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
>>>> 2025-03-26T09:23:53.984Z|38070|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:50:22:c4 185.255.178.170
>>>> 2025-03-26T09:23:53.984Z|38072|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xa020594| in-port=184| src-mac=fa:16:3e:50:22:c4, dst-
>>>> mac=30:b6:4f:5f:db:a0| src-ip=185.255.178.170, dst-ip=185.255.178.1
>>>> 2025-03-26T09:24:51.866Z|38074|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
>>>> fa:16:3e:18:ac:4a 89.169.15.224
>>>> 2025-03-26T09:24:51.866Z|38076|pinctrl(ovn_pinctrl0)|DBG|pinctrl
>>>> received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
>>>> OF_Cookie_ID=0xa020594| in-port=156| src-mac=fa:16:3e:18:ac:4a, dst-
>>>> mac=30:b6:4f:5f:db:a0| src-ip=89.169.15.224, dst-ip=89.169.15.1
>>>>
>>>> And yes, there is logs about resubmit actions which are expected as
>>>> you said.
>>>> ->reg15
>>>>   continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>>>   continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>>> OFPT_FLOW_MOD (OF1.5) (xid=0x235760): ADD table:8
>>>> priority=110,icmp6,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=2,icmp_code=0
>>>>  cookie:0xd1588295 
>>>> actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9)
>>>> OFPT_FLOW_MOD (OF1.5) (xid=0x235761): ADD table:8
>>>> priority=110,icmp,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=3,icmp_code=4
>>>>  cookie:0xd1588295 
>>>> actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9)
>>>> OFPT_FLOW_MOD (OF1.5) (xid=0x23577b): ADD table:80
>>>> priority=100,reg14=0x28a,metadata=0x2 cookie:0xa17d67d7
>>>> actions=set_field:0x9->reg11,set_field:0xa->reg12,resubmit(,8)
>>>> OFPT_FLOW_MOD (OF1.5) (xid=0x23577c): ADD table:43
>>>> priority=100,reg15=0x28a,metadata=0x2 cookie:0xa17d67d7
>>>> actions=set_field:0x1->reg15,resubmit(,43
>>>>
>>>> OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER
>>>> actions=set_field:0x2->metadata,set_field:0x503-
>>>>> reg14,resubmit(CONTROLLER,8) data_len=42
>>>> OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER
>>>> actions=set_field:0x2->metadata,set_field:0x503-
>>>>> reg14,resubmit(CONTROLLER,8) data_len=42
>>>>   continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>>>   continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>>>   continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>>>   continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
>>>>
>>>> top - 09:29:33 up 16:58,  3 users,  load average: 37.70, 37.39, 36.95
>>>> Threads: 114 total,  13 running, 101 sleeping,   0 stopped,   0 zombie
>>>> %Cpu(s): 27.5 us, 11.2 sy,  0.0 ni, 60.3 id,  0.5 wa,  0.0 hi, 0.6
>>>> si,  0.0 st
>>>> MiB Mem : 773901.8 total, 323685.3 free, 423865.0 used,  26351.4 buff/
>>>> cache
>>>> MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345358.7 avail
>>>> Mem
>>>>
>>>> #top -H -p $(pidof ovs-vswitchd)
>>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+
>>>> COMMAND
>>>>     6212 root      20   0 8388104 662016   7940 R  22.3   0.1 32:13.87
>>>> revalidator102
>>>>     6200 root      20   0 8388104 662016   7940 R  21.6   0.1 73:01.76
>>>> revalidator90
>>>>     6202 root      20   0 8388104 662016   7940 R  20.6   0.1 32:13.77
>>>> revalidator92
>>>>     6217 root      20   0 8388104 662016   7940 R  20.3   0.1 26:07.50
>>>> revalidator107
>>>>     6214 root      20   0 8388104 662016   7940 R  18.9   0.1 39:12.60
>>>> revalidator104
>>>>     6219 root      20   0 8388104 662016   7940 R  18.6   0.1 25:57.96
>>>> revalidator109
>>>>     6211 root      20   0 8388104 662016   7940 R  17.6   0.1 39:20.20
>>>> revalidator101
>>>>     6207 root      20   0 8388104 662016   7940 R  13.6   0.1 17:53.31
>>>> revalidator97
>>>>     6204 root      20   0 8388104 662016   7940 R  13.0   0.1 32:34.67
>>>> revalidator94
>>>>     6209 root      20   0 8388104 662016   7940 R  12.6   0.1 35:55.88
>>>> revalidator100
>>>>     6213 root      20   0 8388104 662016   7940 R  12.6   0.1 18:04.43
>>>> revalidator103
>>>>     6220 root      20   0 8388104 662016   7940 R  12.3   0.1 5:52.71
>>>> revalidator110
>>>>     6218 root      20   0 8388104 662016   7940 R  12.0   0.1 9:09.66
>>>> revalidator108
>>>>     6215 root      20   0 8388104 662016   7940 S   8.6   0.1 22:22.27
>>>> revalidator105
>>>>     6221 root      20   0 8388104 662016   7940 S   8.6   0.1 4:44.93
>>>> revalidator111
>>>>     6208 root      20   0 8388104 662016   7940 S   8.0   0.1 22:55.17
>>>> revalidator98
>>>>     6203 root      20   0 8388104 662016   7940 S   6.0   0.1 38:52.72
>>>> revalidator93
>>>>     6206 root      20   0 8388104 662016   7940 S   6.0   0.1 22:11.55
>>>> revalidator96
>>>>     6216 root      20   0 8388104 662016   7940 S   4.3   0.1 33:14.52
>>>> revalidator106
>>>>
>>>> Sometimes revalidator processes drops to 5-10%, sometimes to 30%. I
>>>> guess this behaviour is because of resubmit actions?. So far so good
>>>> controller feeling fine, but there could be some sort of freezes on
>>>> sending DHCPREPLY.
>>>>
>>>> top -H -p $(pidof ovn-controller)
>>>> top - 09:30:55 up 16:59,  3 users,  load average: 35.73, 36.70, 36.73
>>>> Threads:   5 total,   0 running,   5 sleeping,   0 stopped,   0 zombie
>>>> %Cpu(s): 27.7 us, 10.6 sy,  0.0 ni, 60.8 id,  0.3 wa,  0.0 hi, 0.6
>>>> si,  0.0 st
>>>> MiB Mem : 773901.8 total, 323758.4 free, 423785.6 used,  26357.8 buff/
>>>> cache
>>>> MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345438.0 avail
>>>> Mem
>>>>
>>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+
>>>> COMMAND
>>>>     5045 root      20   0  396500  83836   4432 S   0.7   0.0 11:46.59
>>>> ovn-controller
>>>>     5412 root      20   0  396500  83836   4432 S   0.0   0.0 0:06.23
>>>> ovn_pinctrl0
>>>>     5413 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.00
>>>> urcu1
>>>>     5414 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.20
>>>> ovn_statctrl2
>>>>     6103 root      20   0  396500  83836   4432 S   0.0   0.0 0:04.10
>>>> stopwatch3
>>>>
>>>>
>>>> Trying to reproduce on a real-world environment. There is 300
>>>> instances running with about 300Mbps network traffic in total.
>>>> Is there more logs or debug i can provide?
>>>>

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to