Got my hands on this, back to debugging. Seems like kernel runs stable
# uname -r
6.14.0-061400-generic
Meanwhile there is no unrecognized(27) related logs.
tail -f /var/log/kolla/openvswitch/ovn-controller.log | grep -i "dhcp"
2025-03-26T09:23:08.086Z|38050|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:9c:f4:45 185.255.178.131
2025-03-26T09:23:08.086Z|38052|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
2025-03-26T09:23:11.084Z|38054|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:9c:f4:45 185.255.178.131
2025-03-26T09:23:11.085Z|38056|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
2025-03-26T09:23:26.606Z|38058|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:9c:f4:45 185.255.178.131
2025-03-26T09:23:26.606Z|38060|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
2025-03-26T09:23:27.704Z|38062|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:9c:f4:45 185.255.178.131
2025-03-26T09:23:27.704Z|38064|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
2025-03-26T09:23:28.383Z|38066|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:9c:f4:45 185.255.178.131
2025-03-26T09:23:28.383Z|38068|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255
2025-03-26T09:23:53.984Z|38070|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:50:22:c4 185.255.178.170
2025-03-26T09:23:53.984Z|38072|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xa020594| in-port=184| src-mac=fa:16:3e:50:22:c4,
dst-mac=30:b6:4f:5f:db:a0| src-ip=185.255.178.170, dst-ip=185.255.178.1
2025-03-26T09:24:51.866Z|38074|pinctrl(ovn_pinctrl0)|INFO|DHCPACK
fa:16:3e:18:ac:4a 89.169.15.224
2025-03-26T09:24:51.866Z|38076|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0|
OF_Cookie_ID=0xa020594| in-port=156| src-mac=fa:16:3e:18:ac:4a,
dst-mac=30:b6:4f:5f:db:a0| src-ip=89.169.15.224, dst-ip=89.169.15.1
And yes, there is logs about resubmit actions which are expected as you
said.
->reg15
continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
OFPT_FLOW_MOD (OF1.5) (xid=0x235760): ADD table:8
priority=110,icmp6,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=2,icmp_code=0
cookie:0xd1588295
actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9)
OFPT_FLOW_MOD (OF1.5) (xid=0x235761): ADD table:8
priority=110,icmp,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=3,icmp_code=4
cookie:0xd1588295
actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9)
OFPT_FLOW_MOD (OF1.5) (xid=0x23577b): ADD table:80
priority=100,reg14=0x28a,metadata=0x2 cookie:0xa17d67d7
actions=set_field:0x9->reg11,set_field:0xa->reg12,resubmit(,8)
OFPT_FLOW_MOD (OF1.5) (xid=0x23577c): ADD table:43
priority=100,reg15=0x28a,metadata=0x2 cookie:0xa17d67d7
actions=set_field:0x1->reg15,resubmit(,43
OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER
actions=set_field:0x2->metadata,set_field:0x503->reg14,resubmit(CONTROLLER,8)
data_len=42
OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER
actions=set_field:0x2->metadata,set_field:0x503->reg14,resubmit(CONTROLLER,8)
data_len=42
continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
top - 09:29:33 up 16:58, 3 users, load average: 37.70, 37.39, 36.95
Threads: 114 total, 13 running, 101 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.5 us, 11.2 sy, 0.0 ni, 60.3 id, 0.5 wa, 0.0 hi, 0.6 si,
0.0 st
MiB Mem : 773901.8 total, 323685.3 free, 423865.0 used, 26351.4 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 345358.7 avail Mem
#top -H -p $(pidof ovs-vswitchd)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6212 root 20 0 8388104 662016 7940 R 22.3 0.1 32:13.87
revalidator102
6200 root 20 0 8388104 662016 7940 R 21.6 0.1 73:01.76
revalidator90
6202 root 20 0 8388104 662016 7940 R 20.6 0.1 32:13.77
revalidator92
6217 root 20 0 8388104 662016 7940 R 20.3 0.1 26:07.50
revalidator107
6214 root 20 0 8388104 662016 7940 R 18.9 0.1 39:12.60
revalidator104
6219 root 20 0 8388104 662016 7940 R 18.6 0.1 25:57.96
revalidator109
6211 root 20 0 8388104 662016 7940 R 17.6 0.1 39:20.20
revalidator101
6207 root 20 0 8388104 662016 7940 R 13.6 0.1 17:53.31
revalidator97
6204 root 20 0 8388104 662016 7940 R 13.0 0.1 32:34.67
revalidator94
6209 root 20 0 8388104 662016 7940 R 12.6 0.1 35:55.88
revalidator100
6213 root 20 0 8388104 662016 7940 R 12.6 0.1 18:04.43
revalidator103
6220 root 20 0 8388104 662016 7940 R 12.3 0.1 5:52.71
revalidator110
6218 root 20 0 8388104 662016 7940 R 12.0 0.1 9:09.66
revalidator108
6215 root 20 0 8388104 662016 7940 S 8.6 0.1 22:22.27
revalidator105
6221 root 20 0 8388104 662016 7940 S 8.6 0.1 4:44.93
revalidator111
6208 root 20 0 8388104 662016 7940 S 8.0 0.1 22:55.17
revalidator98
6203 root 20 0 8388104 662016 7940 S 6.0 0.1 38:52.72
revalidator93
6206 root 20 0 8388104 662016 7940 S 6.0 0.1 22:11.55
revalidator96
6216 root 20 0 8388104 662016 7940 S 4.3 0.1 33:14.52
revalidator106
Sometimes revalidator processes drops to 5-10%, sometimes to 30%. I
guess this behaviour is because of resubmit actions?. So far so good
controller feeling fine, but there could be some sort of freezes on
sending DHCPREPLY.
top -H -p $(pidof ovn-controller)
top - 09:30:55 up 16:59, 3 users, load average: 35.73, 36.70, 36.73
Threads: 5 total, 0 running, 5 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.7 us, 10.6 sy, 0.0 ni, 60.8 id, 0.3 wa, 0.0 hi, 0.6 si,
0.0 st
MiB Mem : 773901.8 total, 323758.4 free, 423785.6 used, 26357.8 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 345438.0 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5045 root 20 0 396500 83836 4432 S 0.7 0.0 11:46.59
ovn-controller
5412 root 20 0 396500 83836 4432 S 0.0 0.0 0:06.23
ovn_pinctrl0
5413 root 20 0 396500 83836 4432 S 0.0 0.0 0:00.00 urcu1
5414 root 20 0 396500 83836 4432 S 0.0 0.0 0:00.20
ovn_statctrl2
6103 root 20 0 396500 83836 4432 S 0.0 0.0 0:04.10
stopwatch3
Trying to reproduce on a real-world environment. There is 300 instances
running with about 300Mbps network traffic in total.
Is there more logs or debug i can provide?
--
Regards,
Ilia Baikov
ilia.baikov@ib.systems
23.03.2025 11:15, Dumitru Ceara пишет:
On 3/22/25 1:59 AM, Ilia Baikov wrote:
Wow, didn't think that resolving this will require patching kernel
module. Really impressive.
I've previously compiled and then deployed forked version of ovn-
controller where you did commit revert. Is it good idea to test when
stable kernel revision containing this patch will be available (likely
in coming days in 6.14 atleast 6.14-rc7 seems contain this patch)?
Indeed, 6.14-rc7 contains Ilya's fix. It's probably OK to test with it.
Regards,
Dumitru
20.03.2025 12:28, Dumitru Ceara wrote:
On 3/19/25 8:42 PM, Ilia Baikov wrote:
Hello,
Hi Ilia, Piotr,
Nice to hear that it is resolved for you. I've got an advice from friend
about reducing ARP pps on Juniper devices (arp is being send only when
unicast packet arrives or expire) which reduced ARP pps from about 2 to
3 times. It helped a bit, however issue still persists.
@Dumitru Ceara <dce...@redhat.com>
@Ales Musil <amu...@redhat.com>
Any chance this could be fixed in upcoming release or backported to
previous? I could help somehow to speed it up.
I think we're waiting for Ilya Maximets' kernel fix here to trickle down
to stable kernels:
https://lore.kernel.org/all/20250308004609.2881861-1-i.maxim...@ovn.org/
Once that happens I'm guessing we can revert 325c7b203d8b ("controller:
split mg action in table 39 and 40 to fit kernel netlink buffer size").
https://github.com/ovn-org/ovn/commit/325c7b2
We'll still have the 4K resubmit limit problem for BUM packets but
other_config:broadcast-arps-to-all-routers=false should alleviate some
of that (at least for ARPs/NDs).
Hi,
Just as a follow up, after setting the
other_config:broadcast-arps-to-all-routers=false parameter on all
external
networks, number of packets sent to the ovn-controller went down from
around 120-130 p/s to 1-3/s.
Most of our ports in external networks are router ports, so this
parameter
helps a lot.
Thanks
Piotr Misiak
pon., 17 lut 2025, 16:37 u?ytkownik Piotr
Misiak<piotrmisiak1...@gmail.com>
napisa?:
Regards,
Dumitru
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss