Got my hands on this, back to debugging. Seems like kernel runs stable
# uname -r
6.14.0-061400-generic
Meanwhile there is no unrecognized(27) related logs.
tail -f /var/log/kolla/openvswitch/ovn-controller.log | grep -i "dhcp"
2025-03-26T09:23:08.086Z|38050|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:08.086Z|38052|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:11.084Z|38054|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:11.085Z|38056|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:26.606Z|38058|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:26.606Z|38060|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:27.704Z|38062|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:27.704Z|38064|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:28.383Z|38066|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:9c:f4:45 185.255.178.131 2025-03-26T09:23:28.383Z|38068|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xfb6fb11d| in-port=5| src-mac=fa:16:3e:9c:f4:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=0.0.0.0, dst-ip=255.255.255.255 2025-03-26T09:23:53.984Z|38070|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:50:22:c4 185.255.178.170 2025-03-26T09:23:53.984Z|38072|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xa020594| in-port=184| src-mac=fa:16:3e:50:22:c4, dst-mac=30:b6:4f:5f:db:a0| src-ip=185.255.178.170, dst-ip=185.255.178.1 2025-03-26T09:24:51.866Z|38074|pinctrl(ovn_pinctrl0)|INFO|DHCPACK fa:16:3e:18:ac:4a 89.169.15.224 2025-03-26T09:24:51.866Z|38076|pinctrl(ovn_pinctrl0)|DBG|pinctrl received  packet-in | opcode=PUT_DHCP_OPTS| OF_Table_ID=0| OF_Cookie_ID=0xa020594| in-port=156| src-mac=fa:16:3e:18:ac:4a, dst-mac=30:b6:4f:5f:db:a0| src-ip=89.169.15.224, dst-ip=89.169.15.1

And yes, there is logs about resubmit actions which are expected as you said.
->reg15
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
OFPT_FLOW_MOD (OF1.5) (xid=0x235760): ADD table:8 priority=110,icmp6,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=2,icmp_code=0 cookie:0xd1588295 actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9) OFPT_FLOW_MOD (OF1.5) (xid=0x235761): ADD table:8 priority=110,icmp,reg10=0x10000/0x10000,reg15=0x28a,metadata=0x2,dl_src=fa:16:3e:48:61:ad,icmp_type=3,icmp_code=4 cookie:0xd1588295 actions=push:NXM_NX_REG14[],push:NXM_NX_REG15[],pop:NXM_NX_REG14[],pop:NXM_NX_REG15[],resubmit(,9) OFPT_FLOW_MOD (OF1.5) (xid=0x23577b): ADD table:80 priority=100,reg14=0x28a,metadata=0x2 cookie:0xa17d67d7 actions=set_field:0x9->reg11,set_field:0xa->reg12,resubmit(,8) OFPT_FLOW_MOD (OF1.5) (xid=0x23577c): ADD table:43 priority=100,reg15=0x28a,metadata=0x2 cookie:0xa17d67d7 actions=set_field:0x1->reg15,resubmit(,43

OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER actions=set_field:0x2->metadata,set_field:0x503->reg14,resubmit(CONTROLLER,8) data_len=42 OFPT_PACKET_OUT (OF1.5) (xid=0x235622): in_port=CONTROLLER actions=set_field:0x2->metadata,set_field:0x503->reg14,resubmit(CONTROLLER,8) data_len=42
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)
 continuation.actions=unroll_xlate(table=0, cookie=0),resubmit(,32)

top - 09:29:33 up 16:58,  3 users,  load average: 37.70, 37.39, 36.95
Threads: 114 total,  13 running, 101 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.5 us, 11.2 sy,  0.0 ni, 60.3 id,  0.5 wa,  0.0 hi, 0.6 si,  0.0 st
MiB Mem : 773901.8 total, 323685.3 free, 423865.0 used,  26351.4 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345358.7 avail Mem

#top -H -p $(pidof ovs-vswitchd)
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+ COMMAND
   6212 root      20   0 8388104 662016   7940 R  22.3   0.1 32:13.87 revalidator102    6200 root      20   0 8388104 662016   7940 R  21.6   0.1 73:01.76 revalidator90    6202 root      20   0 8388104 662016   7940 R  20.6   0.1 32:13.77 revalidator92    6217 root      20   0 8388104 662016   7940 R  20.3   0.1 26:07.50 revalidator107    6214 root      20   0 8388104 662016   7940 R  18.9   0.1 39:12.60 revalidator104    6219 root      20   0 8388104 662016   7940 R  18.6   0.1 25:57.96 revalidator109    6211 root      20   0 8388104 662016   7940 R  17.6   0.1 39:20.20 revalidator101    6207 root      20   0 8388104 662016   7940 R  13.6   0.1 17:53.31 revalidator97    6204 root      20   0 8388104 662016   7940 R  13.0   0.1 32:34.67 revalidator94    6209 root      20   0 8388104 662016   7940 R  12.6   0.1 35:55.88 revalidator100    6213 root      20   0 8388104 662016   7940 R  12.6   0.1 18:04.43 revalidator103    6220 root      20   0 8388104 662016   7940 R  12.3   0.1 5:52.71 revalidator110    6218 root      20   0 8388104 662016   7940 R  12.0   0.1 9:09.66 revalidator108    6215 root      20   0 8388104 662016   7940 S   8.6   0.1 22:22.27 revalidator105    6221 root      20   0 8388104 662016   7940 S   8.6   0.1 4:44.93 revalidator111    6208 root      20   0 8388104 662016   7940 S   8.0   0.1 22:55.17 revalidator98    6203 root      20   0 8388104 662016   7940 S   6.0   0.1 38:52.72 revalidator93    6206 root      20   0 8388104 662016   7940 S   6.0   0.1 22:11.55 revalidator96    6216 root      20   0 8388104 662016   7940 S   4.3   0.1 33:14.52 revalidator106

Sometimes revalidator processes drops to 5-10%, sometimes to 30%. I guess this behaviour is because of resubmit actions?. So far so good controller feeling fine, but there could be some sort of freezes on sending DHCPREPLY.

top -H -p $(pidof ovn-controller)
top - 09:30:55 up 16:59,  3 users,  load average: 35.73, 36.70, 36.73
Threads:   5 total,   0 running,   5 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.7 us, 10.6 sy,  0.0 ni, 60.8 id,  0.3 wa,  0.0 hi, 0.6 si,  0.0 st
MiB Mem : 773901.8 total, 323758.4 free, 423785.6 used,  26357.8 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 345438.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM TIME+ COMMAND
   5045 root      20   0  396500  83836   4432 S   0.7   0.0 11:46.59 ovn-controller    5412 root      20   0  396500  83836   4432 S   0.0   0.0 0:06.23 ovn_pinctrl0
   5413 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.00 urcu1
   5414 root      20   0  396500  83836   4432 S   0.0   0.0 0:00.20 ovn_statctrl2    6103 root      20   0  396500  83836   4432 S   0.0   0.0 0:04.10 stopwatch3


Trying to reproduce on a real-world environment. There is 300 instances running with about 300Mbps network traffic in total.
Is there more logs or debug i can provide?

--
Regards,

Ilia Baikov
ilia.baikov@ib.systems

23.03.2025 11:15, Dumitru Ceara пишет:
On 3/22/25 1:59 AM, Ilia Baikov wrote:
Wow, didn't think that resolving this will require patching kernel
module. Really impressive.
I've previously compiled and then deployed forked version of ovn-
controller where you did commit revert. Is it good idea to test when
stable kernel revision containing this patch will be available (likely
in coming days in 6.14 atleast 6.14-rc7 seems contain this patch)?

Indeed, 6.14-rc7 contains Ilya's fix.  It's probably OK to test with it.

Regards,
Dumitru

20.03.2025 12:28, Dumitru Ceara wrote:
On 3/19/25 8:42 PM, Ilia Baikov wrote:
Hello,
Hi Ilia, Piotr,

Nice to hear that it is resolved for you. I've got an advice from friend
about reducing ARP pps on Juniper devices (arp is being send only when
unicast packet arrives or expire) which reduced ARP pps from about 2 to
3 times. It helped a bit, however issue still persists.

@Dumitru Ceara <dce...@redhat.com>
@Ales Musil <amu...@redhat.com>
Any chance this could be fixed in upcoming release or backported to
previous? I could help somehow to speed it up.

I think we're waiting for Ilya Maximets' kernel fix here to trickle down
to stable kernels:

https://lore.kernel.org/all/20250308004609.2881861-1-i.maxim...@ovn.org/

Once that happens I'm guessing we can revert 325c7b203d8b ("controller:
split mg action in table 39 and 40 to fit kernel netlink buffer size").

https://github.com/ovn-org/ovn/commit/325c7b2

We'll still have the 4K resubmit limit problem for BUM packets but
other_config:broadcast-arps-to-all-routers=false should alleviate some
of that (at least for ARPs/NDs).

Hi,

Just as a follow up, after setting the
other_config:broadcast-arps-to-all-routers=false parameter on all
external
networks, number of packets sent to the ovn-controller went down from
around 120-130 p/s to 1-3/s.
Most of our ports in external networks are router ports, so this
parameter
helps a lot.

Thanks
Piotr Misiak


pon., 17 lut 2025, 16:37 u?ytkownik Piotr
Misiak<piotrmisiak1...@gmail.com>
napisa?:
Regards,
Dumitru

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to