On 4/1/25 7:30 PM, Ilia Baikov wrote: > Hi Dumitru, Hi Ilia,
> Sure, let's give it a try. Is it good idea to apply this patch on top of > the patch you previously sent to try? Yes, you're right, we should apply it on top of that. I pushed both patches here: https://github.com/dceara/ovn/commits/refs/heads/tmp-branch-24.09-revert-mg-split/ Please ignore the CI failures, like mentioned yesterday the last commit I did is known to break stuff, it's just to confirm that our problem in this case is due to the MC_FLOOD_L2 output we do for ARP requests targetting OVN router IPs. > What topology for testing is more preferrable to try it out? L2 or L3 > (bgp-based networking)? In case of L3 there is actually small amount of > ARP pps compared to L2 where uplink interface (like eno1) is member of > br-ex bridge. > Would it be possible to try both? > Thank you for your involvement, really appreciate it! > No problem, thanks for taking time to help us try things out! Regards, Dumitru > 01.04.2025 15:06, Dumitru Ceara wrote: >> On 4/1/25 3:15 AM, Ilia Baikov wrote: >>> Hello, >> Hi Ilia, >> >>> So the things go way deeper and it becomes way strange as i initially >>> thought. >>> I've migrated to L3 networking using ovn-bgp-agent, in order to reduce >>> ARP packets flooded over all ports attached to br-int. However this >>> didn't help at all and some VMs loses external connectivity (but VMs >>> ports didn't get flooded by hundreds pps of ARP). >>> >>> 2025-04-01T01:03:51.368Z|00035|ofproto_dpif_xlate(handler12)|WARN| >>> Dropped 131 log messages in last 55 seconds (most recently, 10 seconds >>> ago) due to excessive rate >>> 2025-04-01T01:03:51.368Z|00036|ofproto_dpif_xlate(handler12)|WARN|over >>> 4096 resubmit actions on bridge br-int while processing >>> udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678 >>> 2025-04-01T01:04:35.392Z|00054|ofproto_dpif_upcall(handler60)|WARN| >>> Dropped 133 log messages in last 57 seconds (most recently, 13 seconds >>> ago) due to excessive rate >>> 2025-04-01T01:04:35.392Z|00055|ofproto_dpif_upcall(handler60)|WARN|Flow: >>> udp,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137 >>> >>> bridge("br-int") >>> ---------------- >>> 0. in_port=277, priority 100, cookie 0xa70f8aad >>> set_field:0x5d/0xffff->reg13 >>> set_field:0x2->reg11 >>> set_field:0x3->reg12 >>> set_field:0x2->metadata >>> set_field:0x2ac->reg14 >>> set_field:0/0xffff0000->reg13 >>> resubmit(,8) >>> 8. metadata=0x2, priority 50, cookie 0xecab0c71 >>> set_field:0/0x1000->reg10 >>> resubmit(,73) >>> 73. reg14=0x2ac,metadata=0x2, priority 80, cookie 0xa70f8aad >>> set_field:0x1000/0x1000->reg10 >>> move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111] >>> -> NXM_NX_XXREG0[111] is now 0x1 >>> resubmit(,9) >>> 9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65 >>> drop >>> >>> Final flow: >>> udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5d,reg14=0x2ac,metadata=0x2,in_port=277,vlan_tci=0x0000,dl_src=fa:16:3e:24:f1:f7,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.148,nw_dst=83.217.210.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=137,tp_dst=137 >>> Megaflow: >>> recirc_id=0,eth,ip,in_port=277,dl_src=fa:16:3e:24:f1:f7,nw_frag=no >>> Datapath actions: drop >>> 2025-04-01T01:04:48.434Z|00015|ofproto_dpif_xlate(handler58)|WARN| >>> Dropped 7 log messages in last 55 seconds (most recently, 13 seconds >>> ago) due to excessive rate >>> 2025-04-01T01:04:48.435Z|00016|ofproto_dpif_xlate(handler58)|WARN|over >>> 4096 resubmit actions on bridge br-int while processing >>> arp,in_port=487,vlan_tci=0x0000,dl_src=fa:16:3e:de:2a:ce,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=185.255.178.119,arp_tpa=185.255.178.1,arp_op=1,arp_sha=fa:16:3e:de:2a:ce,arp_tha=00:00:00:00:00:00 >>> 2025-04-01T01:05:51.377Z|00017|ofproto_dpif_xlate(handler46)|WARN| >>> Dropped 7 log messages in last 61 seconds (most recently, 29 seconds >>> ago) due to excessive rate >>> 2025-04-01T01:05:51.377Z|00018|ofproto_dpif_xlate(handler46)|WARN|over >>> 4096 resubmit actions on bridge br-int while processing >>> udp,in_port=473,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678 >>> 2025-04-01T01:05:51.377Z|00019|ofproto_dpif_upcall(handler46)|WARN| >>> Dropped 8 log messages in last 63 seconds (most recently, 29 seconds >>> ago) due to excessive rate >>> 2025-04-01T01:05:51.377Z|00020|ofproto_dpif_upcall(handler46)|WARN|Flow: >>> udp,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678 >>> >>> bridge("br-int") >>> ---------------- >>> 0. in_port=278, priority 100, cookie 0xd191f4de >>> set_field:0x5e/0xffff->reg13 >>> set_field:0x2->reg11 >>> set_field:0x3->reg12 >>> set_field:0x2->metadata >>> set_field:0x33d->reg14 >>> set_field:0/0xffff0000->reg13 >>> resubmit(,8) >>> 8. metadata=0x2, priority 50, cookie 0xecab0c71 >>> set_field:0/0x1000->reg10 >>> resubmit(,73) >>> 73. reg14=0x33d,metadata=0x2, priority 80, cookie 0xd191f4de >>> set_field:0x1000/0x1000->reg10 >>> move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111] >>> -> NXM_NX_XXREG0[111] is now 0x1 >>> resubmit(,9) >>> 9. reg0=0x8000/0x8000,metadata=0x2, priority 50, cookie 0x1a158a65 >>> drop >>> >>> Final flow: >>> udp,reg0=0x8000,reg10=0x1000,reg11=0x2,reg12=0x3,reg13=0x5e,reg14=0x33d,metadata=0x2,in_port=278,vlan_tci=0x0000,dl_src=fa:16:3e:66:0c:6e,dl_dst=ff:ff:ff:ff:ff:ff,nw_src=83.217.210.206,nw_dst=255.255.255.255,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=5678,tp_dst=5678 >>> Megaflow: >>> recirc_id=0,eth,ip,in_port=278,dl_src=fa:16:3e:66:0c:6e,nw_frag=no >>> Datapath actions: drop >>> >>> When i'm seeing this issue on mentioned IPs i can see that VM tries to >>> resolve default gateway mac address but with no success since ARP is >>> dropped somewhere because of resubmits and drop action on datapath. I >>> guess now this is not related to OVN controller. Meanwhile DHCP works >>> pretty good :) >> Glad to hear about DHCP. >> >>> Option broadcast-arps-to-all-routers is disabled at both of provider >>> network Logical Switches. >>> >>> Could there be any solution for this? >> It's not a solution, more of a change to confirm whether this is causing >> issues, but could you please try with this commit? >> >> https://github.com/dceara/ovn/commit/ >> fac23e2f6ef6effe3f9a2e0310e78d085750488b >> >> The commit disables flooding of ARP requests that target OVN owned >> router IPs to non-router ports. It's not something that can be accepted >> as is because it breaks some other things, e.g., OVN generated GARP >> requests will not be forwarded properly. >> >> The patch applies on top of the ovn main branch. If you want to try it >> out on older branches the easiest way is to just ignore the test changes. >> >> Looking forward to hear how it went. >> >> Thanks, >> Dumitru >> >>> Regards, >>> Ilia Baikov >>> >>> 26.03.2025 12:34, Ilia Baikov пишет: >>>> Got my hands on this, back to debugging. Seems like kernel runs stable >>>> # uname -r >>>> 6.14.0-061400-generic >>>> Meanwhile there is no unrecognized(27) related logs. >>>> tail -f /var/log/kolla/openvswitch/ovn-controller.log | grep -i "dhcp" >>>> 2025-03-26T09:23:08.086Z|38050|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:9c:f4:45 185.255.178.131 >>>> 2025-03-26T09:23:08.086Z|38052|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst- >>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 >>>> 2025-03-26T09:23:11.084Z|38054|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:9c:f4:45 185.255.178.131 >>>> 2025-03-26T09:23:11.085Z|38056|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst- >>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 >>>> 2025-03-26T09:23:26.606Z|38058|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:9c:f4:45 185.255.178.131 >>>> 2025-03-26T09:23:26.606Z|38060|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst- >>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 >>>> 2025-03-26T09:23:27.704Z|38062|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:9c:f4:45 185.255.178.131 >>>> 2025-03-26T09:23:27.704Z|38064|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst- >>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 >>>> 2025-03-26T09:23:28.383Z|38066|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:9c:f4:45 185.255.178.131 >>>> 2025-03-26T09:23:28.383Z|38068|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst- >>>> mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 >>>> 2025-03-26T09:23:53.984Z|38070|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:50:22:c4 185.255.178.170 >>>> 2025-03-26T09:23:53.984Z|38072|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xa020594| in-port=184| src-mac=fa:16:3e:50:22:c4, dst- >>>> mac=30:b6:4f:5f:db:a0| src-ip=185.255.178.170, dst-ip=185.255.178.1 >>>> 2025-03-26T09:24:51.866Z|38074|pinctrl(ovn_pinctrl0)|INFO|DHCPACK >>>> fa:16:3e:18:ac:4a 89.169.15.224 >>>> 2025-03-26T09:24:51.866Z|38076|pinctrl(ovn_pinctrl0)|DBG|pinctrl >>>> received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| >>>> OF_Cookie_ID=0xa020594| in-port=156| src-mac=fa:16:3e:18:ac:4a, dst- >>>> mac=30:b6:4f:5f:db:a0| src-ip=89.169.15.224, dst-ip=89.169.15.1 >>>> >>>> And yes, there is logs about resubmit actions which are expected as >>>> you said. >>>> ->reg15 >>>> continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32) >>>> continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32) >>>> OFPT_FLOW_MOD (OF1.5) (xid=0x235760): ADD table:8 >>>> priority=110,icmp6,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=2,icmp_code=0 >>>> cookie:0xd1588295 >>>> actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9) >>>> OFPT_FLOW_MOD (OF1.5) (xid=0x235761): ADD table:8 >>>> priority=110,icmp,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=3,icmp_code=4 >>>> cookie:0xd1588295 >>>> actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9) >>>> OFPT_FLOW_MOD (OF1.5) (xid=0x23577b): ADD table:80 >>>> priority=100,reg14=0x28a,metadata=0x2 cookie:0xa17d67d7 >>>> actions=set_field:0x9->reg11,set_field:0xa->reg12,resubmit(,8) >>>> OFPT_FLOW_MOD (OF1.5) (xid=0x23577c): ADD table:43 >>>> priority=100,reg15=0x28a,metadata=0x2 cookie:0xa17d67d7 >>>> actions=set_field:0x1->reg15,resubmit(,43 >>>> >>>> OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER >>>> actions=set_field:0x2->metadata,set_field:0x503- >>>>> reg14,resubmit(CONTROLLER,8) data_len=42 >>>> OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER >>>> actions=set_field:0x2->metadata,set_field:0x503- >>>>> reg14,resubmit(CONTROLLER,8) data_len=42 >>>> continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32) >>>> continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32) >>>> continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32) >>>> continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32) >>>> >>>> top - 09:29:33 up 16:58, 3 users, load average: 37.70, 37.39, 36.95 >>>> Threads: 114 total, 13 running, 101 sleeping, 0 stopped, 0 zombie >>>> %Cpu(s): 27.5 us, 11.2 sy, 0.0 ni, 60.3 id, 0.5 wa, 0.0 hi, 0.6 >>>> si, 0.0 st >>>> MiB Mem : 773901.8 total, 323685.3 free, 423865.0 used, 26351.4 buff/ >>>> cache >>>> MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 345358.7 avail >>>> Mem >>>> >>>> #top -H -p $(pidof ovs-vswitchd) >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >>>> COMMAND >>>> 6212 root 20 0 8388104 662016 7940 R 22.3 0.1 32:13.87 >>>> revalidator102 >>>> 6200 root 20 0 8388104 662016 7940 R 21.6 0.1 73:01.76 >>>> revalidator90 >>>> 6202 root 20 0 8388104 662016 7940 R 20.6 0.1 32:13.77 >>>> revalidator92 >>>> 6217 root 20 0 8388104 662016 7940 R 20.3 0.1 26:07.50 >>>> revalidator107 >>>> 6214 root 20 0 8388104 662016 7940 R 18.9 0.1 39:12.60 >>>> revalidator104 >>>> 6219 root 20 0 8388104 662016 7940 R 18.6 0.1 25:57.96 >>>> revalidator109 >>>> 6211 root 20 0 8388104 662016 7940 R 17.6 0.1 39:20.20 >>>> revalidator101 >>>> 6207 root 20 0 8388104 662016 7940 R 13.6 0.1 17:53.31 >>>> revalidator97 >>>> 6204 root 20 0 8388104 662016 7940 R 13.0 0.1 32:34.67 >>>> revalidator94 >>>> 6209 root 20 0 8388104 662016 7940 R 12.6 0.1 35:55.88 >>>> revalidator100 >>>> 6213 root 20 0 8388104 662016 7940 R 12.6 0.1 18:04.43 >>>> revalidator103 >>>> 6220 root 20 0 8388104 662016 7940 R 12.3 0.1 5:52.71 >>>> revalidator110 >>>> 6218 root 20 0 8388104 662016 7940 R 12.0 0.1 9:09.66 >>>> revalidator108 >>>> 6215 root 20 0 8388104 662016 7940 S 8.6 0.1 22:22.27 >>>> revalidator105 >>>> 6221 root 20 0 8388104 662016 7940 S 8.6 0.1 4:44.93 >>>> revalidator111 >>>> 6208 root 20 0 8388104 662016 7940 S 8.0 0.1 22:55.17 >>>> revalidator98 >>>> 6203 root 20 0 8388104 662016 7940 S 6.0 0.1 38:52.72 >>>> revalidator93 >>>> 6206 root 20 0 8388104 662016 7940 S 6.0 0.1 22:11.55 >>>> revalidator96 >>>> 6216 root 20 0 8388104 662016 7940 S 4.3 0.1 33:14.52 >>>> revalidator106 >>>> >>>> Sometimes revalidator processes drops to 5-10%, sometimes to 30%. I >>>> guess this behaviour is because of resubmit actions?. So far so good >>>> controller feeling fine, but there could be some sort of freezes on >>>> sending DHCPREPLY. >>>> >>>> top -H -p $(pidof ovn-controller) >>>> top - 09:30:55 up 16:59, 3 users, load average: 35.73, 36.70, 36.73 >>>> Threads: 5 total, 0 running, 5 sleeping, 0 stopped, 0 zombie >>>> %Cpu(s): 27.7 us, 10.6 sy, 0.0 ni, 60.8 id, 0.3 wa, 0.0 hi, 0.6 >>>> si, 0.0 st >>>> MiB Mem : 773901.8 total, 323758.4 free, 423785.6 used, 26357.8 buff/ >>>> cache >>>> MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 345438.0 avail >>>> Mem >>>> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >>>> COMMAND >>>> 5045 root 20 0 396500 83836 4432 S 0.7 0.0 11:46.59 >>>> ovn-controller >>>> 5412 root 20 0 396500 83836 4432 S 0.0 0.0 0:06.23 >>>> ovn_pinctrl0 >>>> 5413 root 20 0 396500 83836 4432 S 0.0 0.0 0:00.00 >>>> urcu1 >>>> 5414 root 20 0 396500 83836 4432 S 0.0 0.0 0:00.20 >>>> ovn_statctrl2 >>>> 6103 root 20 0 396500 83836 4432 S 0.0 0.0 0:04.10 >>>> stopwatch3 >>>> >>>> >>>> Trying to reproduce on a real-world environment. There is 300 >>>> instances running with about 300Mbps network traffic in total. >>>> Is there more logs or debug i can provide? >>>> _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss