Thanks again for coming back on this Ilya,

Another option I am looking at here is to switch the kernal path (Open
vSwitch kernel module) with OVS-DOCA as we are using the CX6/7 card
https://docs.nvidia.com/doca/archive/doca-v2.0.2/ovs-doca/index.html

I'm trying to wrangle the documented Known Limitations

- Only one insertion thread is supported (n-offload-threads=1).
- Only a single PF is currently supported.
- Geneve options are not supported.
- Only 250K connections are offloaded by CT offload.
- Only 8 CT zones are supported by CT offload.
- OVS restart on non-tunnel configuration is not supported. Remove all
ports before restarting.

The one that concerns me is the 8 CT zones supported by CT offload, as
potentially with OVN I may have many CT zones if we have many customer
colocated on the same compute node.

Seems any way I turn I'm getting kicked :D

Gav

On Fri, 26 Apr 2024 at 12:42, Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 4/26/24 20:12, Gavin McKee wrote:
> > Thanks for coming back to me on this.
> >
> > Moving kernal versions around is not a straightforward option here -
> > especially when you are using hardware offload .  The OFED driver
> > version is coupled to the kernal so if we move from that we are out of
> > support coverage .
> >
> > Doing an  ovn-appctl -t ovn-controller recompute does not resolve the
> > problem , again just taking a big hammer like restarting openvswitch
> > does.
> >
> > How would we proceed here ?  Are there any Open vSwitch kernal module
> > patches we could try to get a resolution ?
>
> You can try the commit I linked below.  That will mean that you'll
> need to re-build your kernel.  There is no other way.
>
> In the past we had out-of-tree module, but it is deprecated for a
> long time, contains multiple issues and is unlikely to work with
> new kernels, especially heavily modified ones, like RHEL kernels.
>
> Note that the issue is not localized to OVS, but affects TC as well
> as they now share the NAT implementation.  So, even if just swapping
> the openvswitch kernel module was possible, it wouldn't help much.
>
> >
> > One option we are looking at is regressing the entire stack back to Rocky 
> > 9.1 .
>
> This may be an option.  The bug I mentioned in a previous email exists
> in RHEL 9.3, so it exists in Rocky 9.3 as well, at least it should
> since they claim "bug-for-bug compatibility".  So, 3 options to fix this
> particular bug (I don't know if it is causing your issue, but it is
> a severe bug that can potentially be a cause):
>
> 1. Re-build the kernel to include the fix.
> 2. Downgrade from 9.3 to an earlier release.
> 3. Wait for 9.4.
>
> Best regards, Ilya Maximets.
>
> >
> > Gav
> >
> > On Wed, 24 Apr 2024 at 04:44, Ilya Maximets <i.maxim...@ovn.org> wrote:
> >>
> >> On 4/23/24 17:39, Gavin McKee wrote:
> >>> If you look at both traces (non working and working) the thing that
> >>> stands out to me is this
> >>>
> >>> At line 10 in the working file the following entry exists
> >>>     ct_state NEW tcp (SYN_SENT) orig [172.27.16.11.38793 >
> >>> 172.27.31.189.9100] reply [172.27.31.189.9100 > 172.27.16.11.38793]
> >>> zone 195
> >>>
> >>> his doesn't happen in the non working file - I just see the following
> >>>
> >>> 3904992932904126 [swapper/125] 0 [kr] queue_userspace_packet
> >>> #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 4
> >>>     if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 >
> >>> 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP
> >>> (6) flags [S] seq 2266263186 win 42340
> >>>     upcall_enqueue (miss) (125/3904992932890052) q 3682874017 ret 0
> >>>   + 3904992932907247 [swapper/125] 0 [kr] ovs_dp_upcall
> >>> #ddf92049d47aeff1d0b6625620000 (skb 18382861792850905088) n 5
> >>>     if 38 (enp148s0f0_1) rxif 38 172.27.16.11.42303 >
> >>> 172.27.31.189.9100 ttl 64 tos 0x0 id 36932 off 0 [DF] len 52 proto TCP
> >>> (6) flags [S] seq 2266263186 win 42340
> >>>     upcall_ret (125/3904992932890052) ret 0
> >>>
> >>> I am wondering if a failure to track the ct_state SYN is causing the
> >>> returning ACK to drop ?
> >>>
> >>>   + 3904992936344421 [swapper/125] 0 [tp] skb:kfree_skb
> >>> #ddf9204d21c48ff1d0b676c330c00 (skb 18382861792850913792) n 3 drop
> >>> (TC_INGRESS)
> >>>     if 33 (genev_sys_6081) rxif 33 172.27.31.189.9100 >
> >>> 172.27.16.11.42303 ttl 64 tos 0x0 id 0 off 0 [DF] len 52 proto TCP (6)
> >>> flags [S.] seq 605271182 ack 2266263187 win 42340
> >>>
> >>> On Mon, 22 Apr 2024 at 18:54, Gavin McKee <gavmcke...@googlemail.com> 
> >>> wrote:
> >>>>
> >>>> Ok @Adrian Moreno @Flavio Leitner
> >>>>
> >>>> Two more detailed Retis traces attached.  One is not working - the
> >>>> same session that I can't establish a TCP session to on port 9010
> >>>> 172.27.16.11.42303 > 172.27.31.189.9100
> >>>>
> >>>> Then I restart Open vSwtich and try again
> >>>> 172.27.16.11.38793 > 172.27.31.189.9100 (this works post restart)
> >>>>
> >>>> It looks to me in the non working example that we -
> >>>> SEND SYN -> exits the tunnel interface genev_sys_6081 via
> >>>> enp148s0f0np0 - exactly as expected
> >>>> RECV ACK -> tcp_gro_receive then -> net:netif_receive_skb where we hit
> >>>> drop (TC_INGRESS)
> >>>>
> >>>> After a restart things seem to be very different
> >>>>
> >>>> Any ideas where to look next ?
> >>
> >> You mentioned that you're using 5.14.0-362.8.1.el9_3.x86 kernel.
> >> RHEL 9.3 contains a large refactor for OVS connection tracking,
> >> but it doesn't contain at least one fix for this refactor:
> >>   
> >> https://github.com/torvalds/linux/commit/e6345d2824a3f58aab82428d11645e0da861ac13
> >>
> >> This may cause all sorts of incorrect packet processing in the
> >> kernel.  I'd suggest trying the latest upstream v6.8.7 kernel
> >> that has all the known fixes or try the 9.4 beta kernel that
> >> should have the fix mentioned above.  Downgrading to 9.2 may also
> >> be an option since 9.2 doesn't contain the refactoring, IIRC.
> >>
> >> I'd not recommend running with this bug present anyway.
> >>
> >> Updating your OVN 23.09.1 to 23.09.3 may also be worth trying.
> >> The fact that OVS restart fixes the issue may also indicate a
> >> problem with incremental processing in ovn-controller.
> >> Next time the issue happens try to force flow recompute with
> >>   ovn-appctl -t ovn-controller recompute
> >> And see if that fixes the issue.  If it does, it would be great
> >> to have OpenFlow dumps before and after the recompute for
> >> comparison.
> >>
> >> Best regards, Ilya Maximets.
>
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to