Updates. Reverted the commit https://github.com/openvswitch/ovs/commit/ad550ebc36323bac92df8bf31d7527cf5282b731 for ovs-2.13.11. It does not work. VM ports and the vhu socks are still disconnected.
And more, I also reverted the commits: https://github.com/openvswitch/ovs/commit/b837d1fdc46097b1583adf9dd920876d68dbae36 https://github.com/openvswitch/ovs/commit/6f86b1054966a5ebfca6049031d3fc769200af82 Not work again. So I guess maybe the commit: "userspace: Add TCP Segmentation Offload support" https://github.com/openvswitch/ovs/commit/29cf9c1b3b9c4574df4f579c74c4e6d9ebb6d279 should be considered to revert as well. What do you think? Regards, LIU Yulong On Wed, Oct 16, 2024 at 9:42 PM LIU Yulong <liuyulong...@gmail.com> wrote: > > Thank you Ilya. > > I've tried to revert the commit > https://github.com/openvswitch/ovs/commit/514950d37dabebbdfa40ddf87596a7293de2d87c > in the ovs-2.13.11. > https://github.com/openvswitch/ovs/commits/v2.13.11/lib/netdev-dpdk.c > https://github.com/openvswitch/ovs/commit/ad550ebc36323bac92df8bf31d7527cf5282b731 > Currently it seems the VMs ports are all working after upgrading. I > will try to upgrade and downgrade more times to see if it is stable. > And of course we will also test the live migration and other actions > of the guest VM. > I think this may be a solution we can accept at present, at least to > keep the existing virtual machines as they are, regardless of whether > their negotiated features are actually supported by the backend. > > The reason why a higher version of OVS cannot be used is because of > the kernel version and the compatibility of some drivers on our OS. > And since ovs 2.14, the OVS compilation method is also not well > supported in our system. > But We also have some new systems that are already using the LTS > version of OVS 2.17. > > Regards, > LIU Yulong > > On Wed, Oct 16, 2024 at 8:27 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > > > On 10/15/24 11:13, LIU Yulong wrote: > > > Hi community and experts, > > > > > > We have recently attempted to upgrade OVS 2.12+DPDK 18.11 to OVS > > > 2.13.11+DPDK 19.11.14. And then we encountered a state where some > > > virtual machine network cards are down, and users were not able to > > > start the network cards inside the guest VM. > > > After investigating, we found that qemu reported errors (many many > > > times) , which means virtIO feature negotiation failed: > > > 2024-10-15T06:25:16.986398Z qemu-kvm: failed to init vhost_net for queue 0 > > > vhost lacks feature mask 16384 for backend > > > > > > Which means the backend of virtIO, aka vhostuser, does not support > > > 16384 (the 14th in feature bits). > > > Source code definition bit: > > > #define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */ > > > > > > In the same host, if the HOST_UFO bit of some virtual machines is set > > > to 1, the network card cannot start. While some are 0, it can be > > > started. > > > > > > We found some useful series of links: > > > https://mail.openvswitch.org/pipermail/ovs-dev/2023-June/405829.html > > > https://bugzilla.redhat.com/show_bug.cgi?id=1845488#c5 > > > > > > The conclusion seems to be that such hot upgrade is impossible to > > > achieve. If the guest VM is not restarted or the network card is not > > > redo hot unplug and plug, the user's network card will not be able to > > > work properly. This situation is unacceptable for a cloud environment > > > because we cannot require all user VMs to be restarted. > > > > > > Therefore, I'm asking here if there is a possible work around to > > > achieve such an upgrade? > > > > Hi, unfortunately, I don't think there is a way forward that doesn't involve > > cold migration / restart / port hot-replug. > > > > The issue is that at some point we accidentally exposed UFO and a few other > > features for negotiation due to compound of different factors. Ideally, > > those features would not be acked / negotiated, because we did not advertise > > prerequisite features. However, AFAIU, none of virtio/vhost-net > > implementation > > parts including DPDK, QEMU and the kernel actually comply with virtio-net > > spec > > and accept feature flags for which dependencies are not satisfied. So, > > these > > features end up acked by QEMU and the guest driver even if they are not > > allowed > > to use them. Unfortunately for us that means that if we do the right thing > > and > > turn these features off on OVS side, we will not be able to connect to QEMU > > that did already expose these features to the guest. > > > > As I mentioned, at some point we did expose UFO to the guest by mistake. > > Then it was fixed by the following commit: > > > > https://github.com/openvswitch/ovs/commit/514950d37dabebbdfa40ddf87596a7293de2d87c > > You may see that this patch also makes the wrong assumption for TSO case > > that > > disabling checksum offload will end up with TSO/UFO not being enabled. > > Later > > it was fixed + worked around while trying to figure out enabling checksum > > offload by default, but we still can't really work around unsupported ECN. > > At least, nobody seem to use ECN, so that wasn't a huge problem so far. > > > > Unfortunately again, the fact that commit 514950d37dab breaks live migration > > and upgrades was discovered too late and reverting this commit wasn't an > > option. > > Also because reverting it would mean that we would start advertising > > incorrect > > features again, which is not good. > > > > The only way to make your VMs work without restarting / re-plugging is to > > remove VIRTIO_NET_F_HOST_UFO from the vhost_unsup_flags. But once you do > > that, > > you'll have to keep that broken workaround literally forever, as all the > > newly > > started VMs will have it negotiated and hence will have the same problem. > > > > This will also become a big problem once you go to OVS 3.2+ where checksum > > offload is enabled by default, so your negotiated UFO will now be allowed to > > be used by the guest and that will break OVS, because we do not support UFO > > on OVS side and, unlike ECN, we can't really ignore it. > > > > The best available solution, I think, is to plan the upgrade and gradually > > cold-migrate (not live) VMs from nodes with old OVS to nodes with upgraded > > one. > > I'd also suggest to migrate to some supported version of OVS instead of > > 2.13. > > OVS 3.3 LTS might be a good choice. > > > > FWIW, while upgrade from pre-2.13 to post-2.13 is not possible without > > restart, > > upgrades from 2.13+ forward should not have such issues. > > > > I had an idea that the issue could be solved by QEMU not acking features > > that > > do not have satisfied dependencies and clearing features with not satisfied > > dependencies from the acked feature set during live migration. Since the > > guest > > is not allowed to use those anyway, it should not cause problems. And if > > the > > guest will re-negotiate it will receive an updated feature set without those > > non-satisfied dependencies and we can move on with our lives... But this > > requires a lot of considerations and discussion with QEMU / virtio > > maintainers. > > I'll start the thread on qemu-devel to check if there are issues with such > > a solution or if it is even possible or acceptable. Either way, such a > > change > > will unlikely be backported to older versions of QEMU. > > > > Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss