Thank you Ilya. I've tried to revert the commit https://github.com/openvswitch/ovs/commit/514950d37dabebbdfa40ddf87596a7293de2d87c in the ovs-2.13.11. https://github.com/openvswitch/ovs/commits/v2.13.11/lib/netdev-dpdk.c https://github.com/openvswitch/ovs/commit/ad550ebc36323bac92df8bf31d7527cf5282b731 Currently it seems the VMs ports are all working after upgrading. I will try to upgrade and downgrade more times to see if it is stable. And of course we will also test the live migration and other actions of the guest VM. I think this may be a solution we can accept at present, at least to keep the existing virtual machines as they are, regardless of whether their negotiated features are actually supported by the backend.
The reason why a higher version of OVS cannot be used is because of the kernel version and the compatibility of some drivers on our OS. And since ovs 2.14, the OVS compilation method is also not well supported in our system. But We also have some new systems that are already using the LTS version of OVS 2.17. Regards, LIU Yulong On Wed, Oct 16, 2024 at 8:27 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 10/15/24 11:13, LIU Yulong wrote: > > Hi community and experts, > > > > We have recently attempted to upgrade OVS 2.12+DPDK 18.11 to OVS > > 2.13.11+DPDK 19.11.14. And then we encountered a state where some > > virtual machine network cards are down, and users were not able to > > start the network cards inside the guest VM. > > After investigating, we found that qemu reported errors (many many > > times) , which means virtIO feature negotiation failed: > > 2024-10-15T06:25:16.986398Z qemu-kvm: failed to init vhost_net for queue 0 > > vhost lacks feature mask 16384 for backend > > > > Which means the backend of virtIO, aka vhostuser, does not support > > 16384 (the 14th in feature bits). > > Source code definition bit: > > #define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */ > > > > In the same host, if the HOST_UFO bit of some virtual machines is set > > to 1, the network card cannot start. While some are 0, it can be > > started. > > > > We found some useful series of links: > > https://mail.openvswitch.org/pipermail/ovs-dev/2023-June/405829.html > > https://bugzilla.redhat.com/show_bug.cgi?id=1845488#c5 > > > > The conclusion seems to be that such hot upgrade is impossible to > > achieve. If the guest VM is not restarted or the network card is not > > redo hot unplug and plug, the user's network card will not be able to > > work properly. This situation is unacceptable for a cloud environment > > because we cannot require all user VMs to be restarted. > > > > Therefore, I'm asking here if there is a possible work around to > > achieve such an upgrade? > > Hi, unfortunately, I don't think there is a way forward that doesn't involve > cold migration / restart / port hot-replug. > > The issue is that at some point we accidentally exposed UFO and a few other > features for negotiation due to compound of different factors. Ideally, > those features would not be acked / negotiated, because we did not advertise > prerequisite features. However, AFAIU, none of virtio/vhost-net > implementation > parts including DPDK, QEMU and the kernel actually comply with virtio-net spec > and accept feature flags for which dependencies are not satisfied. So, these > features end up acked by QEMU and the guest driver even if they are not > allowed > to use them. Unfortunately for us that means that if we do the right thing > and > turn these features off on OVS side, we will not be able to connect to QEMU > that did already expose these features to the guest. > > As I mentioned, at some point we did expose UFO to the guest by mistake. > Then it was fixed by the following commit: > > https://github.com/openvswitch/ovs/commit/514950d37dabebbdfa40ddf87596a7293de2d87c > You may see that this patch also makes the wrong assumption for TSO case that > disabling checksum offload will end up with TSO/UFO not being enabled. Later > it was fixed + worked around while trying to figure out enabling checksum > offload by default, but we still can't really work around unsupported ECN. > At least, nobody seem to use ECN, so that wasn't a huge problem so far. > > Unfortunately again, the fact that commit 514950d37dab breaks live migration > and upgrades was discovered too late and reverting this commit wasn't an > option. > Also because reverting it would mean that we would start advertising incorrect > features again, which is not good. > > The only way to make your VMs work without restarting / re-plugging is to > remove VIRTIO_NET_F_HOST_UFO from the vhost_unsup_flags. But once you do > that, > you'll have to keep that broken workaround literally forever, as all the newly > started VMs will have it negotiated and hence will have the same problem. > > This will also become a big problem once you go to OVS 3.2+ where checksum > offload is enabled by default, so your negotiated UFO will now be allowed to > be used by the guest and that will break OVS, because we do not support UFO > on OVS side and, unlike ECN, we can't really ignore it. > > The best available solution, I think, is to plan the upgrade and gradually > cold-migrate (not live) VMs from nodes with old OVS to nodes with upgraded > one. > I'd also suggest to migrate to some supported version of OVS instead of 2.13. > OVS 3.3 LTS might be a good choice. > > FWIW, while upgrade from pre-2.13 to post-2.13 is not possible without > restart, > upgrades from 2.13+ forward should not have such issues. > > I had an idea that the issue could be solved by QEMU not acking features that > do not have satisfied dependencies and clearing features with not satisfied > dependencies from the acked feature set during live migration. Since the > guest > is not allowed to use those anyway, it should not cause problems. And if the > guest will re-negotiate it will receive an updated feature set without those > non-satisfied dependencies and we can move on with our lives... But this > requires a lot of considerations and discussion with QEMU / virtio > maintainers. > I'll start the thread on qemu-devel to check if there are issues with such > a solution or if it is even possible or acceptable. Either way, such a change > will unlikely be backported to older versions of QEMU. > > Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss