Thanks for coming back to me on this. Moving kernal versions around is not a straightforward option here - especially when you are using hardware offload . The OFED driver version is coupled to the kernal so if we move from that we are out of support coverage .
Doing an ovn-appctl -t ovn-controller recompute does not resolve the problem , again just taking a big hammer like restarting openvswitch does. How would we proceed here ? Are there any Open vSwitch kernal module patches we could try to get a resolution ? One option we are looking at is regressing the entire stack back to Rocky 9.1 . Gav On Wed, 24 Apr 2024 at 04:44, Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 4/23/24 17:39, Gavin McKee wrote: > > If you look at both traces (non working and working) the thing that > > stands out to me is this > > > > At line 10 in the working file the following entry exists > > ct_state NEW tcp (SYN_SENT) orig [172.27.16.11.38793 > > > 172.27.31.189.9100] reply [172.27.31.189.9100 > 172.27.16.11.38793] > > zone 195 > > > > his doesn't happen in the non working file - I just see the following > > > > 3904992932904126 [swapper/125] 0 [kr] queue_userspace_packet > > #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 4 > > if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 > > > 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP > > (6) flags [S] seq 2266263186 win 42340 > > upcall_enqueue (miss) (125/3904992932890052) q 3682874017 ret 0 > > + 3904992932907247 [swapper/125] 0 [kr] ovs_dp_upcall > > #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 5 > > if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 > > > 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP > > (6) flags [S] seq 2266263186 win 42340 > > upcall_ret (125/3904992932890052) ret 0 > > > > I am wondering if a failure to track the ct_state SYN is causing the > > returning ACK to drop ? > > > > + 3904992936344421 [swapper/125] 0 [tp] skb:kfree_skb > > #ddf9204d21c48ff1d0b676c330c00 (skb 18382861792850913792) n 3 drop > > (TC_INGRESS) > > if 33 (genev_sys_6081) rxif 33 172.27.31.189.9100 > > > 172.27.16.11.42303 ttl 64 tos 0x0 id 0 off 0 [DF] len 52 proto TCP (6) > > flags [S.] seq 605271182 ack 2266263187 win 42340 > > > > On Mon, 22 Apr 2024 at 18:54, Gavin McKee <gavmcke...@googlemail.com> wrote: > >> > >> Ok @Adrian Moreno @Flavio Leitner > >> > >> Two more detailed Retis traces attached. One is not working - the > >> same session that I can't establish a TCP session to on port 9010 > >> 172.27.16.11.42303 > 172.27.31.189.9100 > >> > >> Then I restart Open vSwtich and try again > >> 172.27.16.11.38793 > 172.27.31.189.9100 (this works post restart) > >> > >> It looks to me in the non working example that we - > >> SEND SYN -> exits the tunnel interface genev_sys_6081 via > >> enp148s0f0np0 - exactly as expected > >> RECV ACK -> tcp_gro_receive then -> net:netif_receive_skb where we hit > >> drop (TC_INGRESS) > >> > >> After a restart things seem to be very different > >> > >> Any ideas where to look next ? > > You mentioned that you're using 5.14.0-362.8.1.el9_3.x86 kernel. > RHEL 9.3 contains a large refactor for OVS connection tracking, > but it doesn't contain at least one fix for this refactor: > > https://github.com/torvalds/linux/commit/e6345d2824a3f58aab82428d11645e0da861ac13 > > This may cause all sorts of incorrect packet processing in the > kernel. I'd suggest trying the latest upstream v6.8.7 kernel > that has all the known fixes or try the 9.4 beta kernel that > should have the fix mentioned above. Downgrading to 9.2 may also > be an option since 9.2 doesn't contain the refactoring, IIRC. > > I'd not recommend running with this bug present anyway. > > Updating your OVN 23.09.1 to 23.09.3 may also be worth trying. > The fact that OVS restart fixes the issue may also indicate a > problem with incremental processing in ovn-controller. > Next time the issue happens try to force flow recompute with > ovn-appctl -t ovn-controller recompute > And see if that fixes the issue. If it does, it would be great > to have OpenFlow dumps before and after the recompute for > comparison. > > Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss