On 4/23/24 17:39, Gavin McKee wrote: > If you look at both traces (non working and working) the thing that > stands out to me is this > > At line 10 in the working file the following entry exists > ct_state NEW tcp (SYN_SENT) orig [172.27.16.11.38793 > > 172.27.31.189.9100] reply [172.27.31.189.9100 > 172.27.16.11.38793] > zone 195 > > his doesn't happen in the non working file - I just see the following > > 3904992932904126 [swapper/125] 0 [kr] queue_userspace_packet > #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 4 > if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 > > 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP > (6) flags [S] seq 2266263186 win 42340 > upcall_enqueue (miss) (125/3904992932890052) q 3682874017 ret 0 > + 3904992932907247 [swapper/125] 0 [kr] ovs_dp_upcall > #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 5 > if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 > > 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP > (6) flags [S] seq 2266263186 win 42340 > upcall_ret (125/3904992932890052) ret 0 > > I am wondering if a failure to track the ct_state SYN is causing the > returning ACK to drop ? > > + 3904992936344421 [swapper/125] 0 [tp] skb:kfree_skb > #ddf9204d21c48ff1d0b676c330c00 (skb 18382861792850913792) n 3 drop > (TC_INGRESS) > if 33 (genev_sys_6081) rxif 33 172.27.31.189.9100 > > 172.27.16.11.42303 ttl 64 tos 0x0 id 0 off 0 [DF] len 52 proto TCP (6) > flags [S.] seq 605271182 ack 2266263187 win 42340 > > On Mon, 22 Apr 2024 at 18:54, Gavin McKee <gavmcke...@googlemail.com> wrote: >> >> Ok @Adrian Moreno @Flavio Leitner >> >> Two more detailed Retis traces attached. One is not working - the >> same session that I can't establish a TCP session to on port 9010 >> 172.27.16.11.42303 > 172.27.31.189.9100 >> >> Then I restart Open vSwtich and try again >> 172.27.16.11.38793 > 172.27.31.189.9100 (this works post restart) >> >> It looks to me in the non working example that we - >> SEND SYN -> exits the tunnel interface genev_sys_6081 via >> enp148s0f0np0 - exactly as expected >> RECV ACK -> tcp_gro_receive then -> net:netif_receive_skb where we hit >> drop (TC_INGRESS) >> >> After a restart things seem to be very different >> >> Any ideas where to look next ?
You mentioned that you're using 5.14.0-362.8.1.el9_3.x86 kernel. RHEL 9.3 contains a large refactor for OVS connection tracking, but it doesn't contain at least one fix for this refactor: https://github.com/torvalds/linux/commit/e6345d2824a3f58aab82428d11645e0da861ac13 This may cause all sorts of incorrect packet processing in the kernel. I'd suggest trying the latest upstream v6.8.7 kernel that has all the known fixes or try the 9.4 beta kernel that should have the fix mentioned above. Downgrading to 9.2 may also be an option since 9.2 doesn't contain the refactoring, IIRC. I'd not recommend running with this bug present anyway. Updating your OVN 23.09.1 to 23.09.3 may also be worth trying. The fact that OVS restart fixes the issue may also indicate a problem with incremental processing in ovn-controller. Next time the issue happens try to force flow recompute with ovn-appctl -t ovn-controller recompute And see if that fixes the issue. If it does, it would be great to have OpenFlow dumps before and after the recompute for comparison. Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss