Thanks for coming back to me on this.

Moving kernal versions around is not a straightforward option here -
especially when you are using hardware offload .  The OFED driver
version is coupled to the kernal so if we move from that we are out of
support coverage .

Doing an  ovn-appctl -t ovn-controller recompute does not resolve the
problem , again just taking a big hammer like restarting openvswitch
does.

How would we proceed here ?  Are there any Open vSwitch kernal module
patches we could try to get a resolution ?

One option we are looking at is regressing the entire stack back to Rocky 9.1 .

Gav

On Wed, 24 Apr 2024 at 04:44, Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 4/23/24 17:39, Gavin McKee wrote:
> > If you look at both traces (non working and working) the thing that
> > stands out to me is this
> >
> > At line 10 in the working file the following entry exists
> >     ct_state NEW tcp (SYN_SENT) orig [172.27.16.11.38793 >
> > 172.27.31.189.9100] reply [172.27.31.189.9100 > 172.27.16.11.38793]
> > zone 195
> >
> > his doesn't happen in the non working file - I just see the following
> >
> > 3904992932904126 [swapper/125] 0 [kr] queue_userspace_packet
> > #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 4
> >     if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 >
> > 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP
> > (6) flags [S] seq 2266263186 win 42340
> >     upcall_enqueue (miss) (125/3904992932890052) q 3682874017 ret 0
> >   + 3904992932907247 [swapper/125] 0 [kr] ovs_dp_upcall
> > #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 5
> >     if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 >
> > 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP
> > (6) flags [S] seq 2266263186 win 42340
> >     upcall_ret (125/3904992932890052) ret 0
> >
> > I am wondering if a failure to track the ct_state SYN is causing the
> > returning ACK to drop ?
> >
> >   + 3904992936344421 [swapper/125] 0 [tp] skb:kfree_skb
> > #ddf9204d21c48ff1d0b676c330c00 (skb 18382861792850913792) n 3 drop
> > (TC_INGRESS)
> >     if 33 (genev_sys_6081) rxif 33 172.27.31.189.9100 >
> > 172.27.16.11.42303 ttl 64 tos 0x0 id 0 off 0 [DF] len 52 proto TCP (6)
> > flags [S.] seq 605271182 ack 2266263187 win 42340
> >
> > On Mon, 22 Apr 2024 at 18:54, Gavin McKee <gavmcke...@googlemail.com> wrote:
> >>
> >> Ok @Adrian Moreno @Flavio Leitner
> >>
> >> Two more detailed Retis traces attached.  One is not working - the
> >> same session that I can't establish a TCP session to on port 9010
> >> 172.27.16.11.42303 > 172.27.31.189.9100
> >>
> >> Then I restart Open vSwtich and try again
> >> 172.27.16.11.38793 > 172.27.31.189.9100 (this works post restart)
> >>
> >> It looks to me in the non working example that we -
> >> SEND SYN -> exits the tunnel interface genev_sys_6081 via
> >> enp148s0f0np0 - exactly as expected
> >> RECV ACK -> tcp_gro_receive then -> net:netif_receive_skb where we hit
> >> drop (TC_INGRESS)
> >>
> >> After a restart things seem to be very different
> >>
> >> Any ideas where to look next ?
>
> You mentioned that you're using 5.14.0-362.8.1.el9_3.x86 kernel.
> RHEL 9.3 contains a large refactor for OVS connection tracking,
> but it doesn't contain at least one fix for this refactor:
>   
> https://github.com/torvalds/linux/commit/e6345d2824a3f58aab82428d11645e0da861ac13
>
> This may cause all sorts of incorrect packet processing in the
> kernel.  I'd suggest trying the latest upstream v6.8.7 kernel
> that has all the known fixes or try the 9.4 beta kernel that
> should have the fix mentioned above.  Downgrading to 9.2 may also
> be an option since 9.2 doesn't contain the refactoring, IIRC.
>
> I'd not recommend running with this bug present anyway.
>
> Updating your OVN 23.09.1 to 23.09.3 may also be worth trying.
> The fact that OVS restart fixes the issue may also indicate a
> problem with incremental processing in ovn-controller.
> Next time the issue happens try to force flow recompute with
>   ovn-appctl -t ovn-controller recompute
> And see if that fixes the issue.  If it does, it would be great
> to have OpenFlow dumps before and after the recompute for
> comparison.
>
> Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to