On 4/26/24 20:12, Gavin McKee wrote: > Thanks for coming back to me on this. > > Moving kernal versions around is not a straightforward option here - > especially when you are using hardware offload . The OFED driver > version is coupled to the kernal so if we move from that we are out of > support coverage . > > Doing an ovn-appctl -t ovn-controller recompute does not resolve the > problem , again just taking a big hammer like restarting openvswitch > does. > > How would we proceed here ? Are there any Open vSwitch kernal module > patches we could try to get a resolution ?
You can try the commit I linked below. That will mean that you'll need to re-build your kernel. There is no other way. In the past we had out-of-tree module, but it is deprecated for a long time, contains multiple issues and is unlikely to work with new kernels, especially heavily modified ones, like RHEL kernels. Note that the issue is not localized to OVS, but affects TC as well as they now share the NAT implementation. So, even if just swapping the openvswitch kernel module was possible, it wouldn't help much. > > One option we are looking at is regressing the entire stack back to Rocky 9.1 > . This may be an option. The bug I mentioned in a previous email exists in RHEL 9.3, so it exists in Rocky 9.3 as well, at least it should since they claim "bug-for-bug compatibility". So, 3 options to fix this particular bug (I don't know if it is causing your issue, but it is a severe bug that can potentially be a cause): 1. Re-build the kernel to include the fix. 2. Downgrade from 9.3 to an earlier release. 3. Wait for 9.4. Best regards, Ilya Maximets. > > Gav > > On Wed, 24 Apr 2024 at 04:44, Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 4/23/24 17:39, Gavin McKee wrote: >>> If you look at both traces (non working and working) the thing that >>> stands out to me is this >>> >>> At line 10 in the working file the following entry exists >>> ct_state NEW tcp (SYN_SENT) orig [172.27.16.11.38793 > >>> 172.27.31.189.9100] reply [172.27.31.189.9100 > 172.27.16.11.38793] >>> zone 195 >>> >>> his doesn't happen in the non working file - I just see the following >>> >>> 3904992932904126 [swapper/125] 0 [kr] queue_userspace_packet >>> #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 4 >>> if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 > >>> 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP >>> (6) flags [S] seq 2266263186 win 42340 >>> upcall_enqueue (miss) (125/3904992932890052) q 3682874017 ret 0 >>> + 3904992932907247 [swapper/125] 0 [kr] ovs_dp_upcall >>> #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 5 >>> if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 > >>> 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP >>> (6) flags [S] seq 2266263186 win 42340 >>> upcall_ret (125/3904992932890052) ret 0 >>> >>> I am wondering if a failure to track the ct_state SYN is causing the >>> returning ACK to drop ? >>> >>> + 3904992936344421 [swapper/125] 0 [tp] skb:kfree_skb >>> #ddf9204d21c48ff1d0b676c330c00 (skb 18382861792850913792) n 3 drop >>> (TC_INGRESS) >>> if 33 (genev_sys_6081) rxif 33 172.27.31.189.9100 > >>> 172.27.16.11.42303 ttl 64 tos 0x0 id 0 off 0 [DF] len 52 proto TCP (6) >>> flags [S.] seq 605271182 ack 2266263187 win 42340 >>> >>> On Mon, 22 Apr 2024 at 18:54, Gavin McKee <gavmcke...@googlemail.com> wrote: >>>> >>>> Ok @Adrian Moreno @Flavio Leitner >>>> >>>> Two more detailed Retis traces attached. One is not working - the >>>> same session that I can't establish a TCP session to on port 9010 >>>> 172.27.16.11.42303 > 172.27.31.189.9100 >>>> >>>> Then I restart Open vSwtich and try again >>>> 172.27.16.11.38793 > 172.27.31.189.9100 (this works post restart) >>>> >>>> It looks to me in the non working example that we - >>>> SEND SYN -> exits the tunnel interface genev_sys_6081 via >>>> enp148s0f0np0 - exactly as expected >>>> RECV ACK -> tcp_gro_receive then -> net:netif_receive_skb where we hit >>>> drop (TC_INGRESS) >>>> >>>> After a restart things seem to be very different >>>> >>>> Any ideas where to look next ? >> >> You mentioned that you're using 5.14.0-362.8.1.el9_3.x86 kernel. >> RHEL 9.3 contains a large refactor for OVS connection tracking, >> but it doesn't contain at least one fix for this refactor: >> >> https://github.com/torvalds/linux/commit/e6345d2824a3f58aab82428d11645e0da861ac13 >> >> This may cause all sorts of incorrect packet processing in the >> kernel. I'd suggest trying the latest upstream v6.8.7 kernel >> that has all the known fixes or try the 9.4 beta kernel that >> should have the fix mentioned above. Downgrading to 9.2 may also >> be an option since 9.2 doesn't contain the refactoring, IIRC. >> >> I'd not recommend running with this bug present anyway. >> >> Updating your OVN 23.09.1 to 23.09.3 may also be worth trying. >> The fact that OVS restart fixes the issue may also indicate a >> problem with incremental processing in ovn-controller. >> Next time the issue happens try to force flow recompute with >> ovn-appctl -t ovn-controller recompute >> And see if that fixes the issue. If it does, it would be great >> to have OpenFlow dumps before and after the recompute for >> comparison. >> >> Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss