On Mon, Jul 30, 2018 at 04:10:04PM +0800, Tiwei Bie wrote: > On Fri, Jul 27, 2018 at 02:03:00PM -0600, Alex Williamson wrote: > > On Fri, 27 Jul 2018 09:58:05 +0800 > > Tiwei Bie <tiwei....@intel.com> wrote: > > > > > On Thu, Jul 26, 2018 at 02:45:39PM -0600, Alex Williamson wrote: > > > > On Mon, 23 Jul 2018 12:59:56 +0800 > > > > Tiwei Bie <tiwei....@intel.com> wrote: > > > > > > > [...] > > > > > > > > > > +static int vhost_user_slave_handle_vfio_group(struct vhost_dev *dev, > > > > > + int *fd) > > > > > +{ > > > > > + struct vhost_user *u = dev->opaque; > > > > > + VhostUserState *user = u->user; > > > > > + VirtIODevice *vdev = dev->vdev; > > > > > + int groupfd = fd[0]; > > > > > + VFIOGroup *group; > > > > > + > > > > > + if (!virtio_has_feature(dev->protocol_features, > > > > > + VHOST_USER_PROTOCOL_F_VFIO_GROUP) || > > > > > + vdev == NULL) { > > > > > + return -1; > > > > > + } > > > > > + > > > > > + if (user->vfio_group) { > > > > > + vfio_put_group(user->vfio_group); > > > > > + user->vfio_group = NULL; > > > > > + } > > > > > + > > > > > > There should be a below check here (I missed it in this > > > patch, sorry..): > > > > > > if (groupfd < 0) { > > > return 0; > > > } > > > > > > > > + group = vfio_get_group_from_fd(groupfd, vdev->dma_as, NULL); > > > > > + if (group == NULL) { > > > > > + return -1; > > > > > + } > > > > > + > > > > > + if (group->fd != groupfd) { > > > > > + close(groupfd); > > > > > + } > > > > > + > > > > > + user->vfio_group = group; > > > > > + fd[0] = -1; > > > > > + > > > > > + return 0; > > > > > +} > > > > > > > > This all looks very sketchy, we're reaching into vfio internal state > > > > and arbitrarily releasing data structures, reusing existing ones, and > > > > maybe creating new ones. We know that a vfio group only allows a > > > > single open, right? > > > > > > Yeah, exactly! > > > > > > When the vhost-user backend process has opened a VFIO group fd, > > > QEMU won't be able to open this group file via open() any more. > > > So vhost-user backend process will share the VFIO group fd to > > > QEMU via the UNIX socket. And that's why we're introducing a > > > new API: > > > > > > vfio_get_group_from_fd(groupfd, ...); > > > > > > for vfio/common.c to get (setup) a VFIOGroup structure. (Because > > > in this case, vfio/common.c can't open this group file via open(), > > > and what we have is a VFIO group fd shared by another process via > > > UNIX socket). And ... > > > > > > > So if you have a groupfd, vfio will never have > > > > that same group opened already. > > > > > > ... if the same vhost-user backend process shares the same VFIO > > > group fd more than one times via different vhost-user slave channels > > > to different vhost-user frontends in the same QEMU process, the > > > same VFIOGroup could exist already. > > > > > > This case could happen when e.g. there are two vhost accelerator > > > devices in the same VFIO group, and they are managed by the same > > > vhost-user backend process, and used to accelerate two different > > > virtio devices in QEMU. In this case, the same VFIO group fd could > > > be sent twice. > > > > > > If we need to support this case, more changes in vfio/common.c > > > will be needed, e.g. 1) add a refcnt in VFIOGroup to support > > > getting and putting a VFIOGroup structure multiple times, > > > 2) add a lock to make vfio_get_group*() and vfio_put_group() > > > thread-safe. > > > > This is the sort of case where it seems like we're just grabbing > > internal interfaces and using them from other modules. VFIOGroups have > > a device_list and currently handles multiple puts by testing whether > > that device list is empty (ie. we already have a refcnt effectively). > > Now we have a group user that has no devices, which is not something > > that it seems like we've designed or audited the code for handling. > > IOW, while the header file lives in a common location, it's not really > > intended to be an API within QEMU and it makes me a bit uncomfortable > > to use it as such. > > > > > > Is that the goal? It's not obvious in > > > > the code. I don't really understand what vhost goes on to do with this > > > > group, > > > > > > The goal is that, we have a VFIO group opened by an external > > > vhost-user backend process, this process will manage the VFIO > > > device, but want QEMU to manage the VFIO group, including: > > > > > > 1 - program the DMA mappings for this VFIO group based on > > > the front-end device's dma_as in QEMU. > > > > > > 2 - add this VFIO group to KVM (KVM_DEV_VFIO_GROUP_ADD). > > > > > > Vhost-user in QEMU as the vhost-user frontend just wants > > > hw/vfio to do above things after receiving a VFIO group fd > > > from the vhost-user backend process. > > > > > > Vhost-user will just ask hw/vfio to get (setup) a VFIOGroup > > > by calling vfio_get_group_from_fd() or put this VFIOGroup by > > > calling vfio_put_group() when it's not needed anymore, and > > > won't do anything else. > > > > > > Vhost-user will only deal with the VFIOGroups allocated by > > > itself, and won't influence any other VFIOGroups used by the > > > VFIO passthru devices (-device vfio-pci). Because the same > > > VFIO group file can only be opened by one process via open(). > > > > But there's nothing here that guarantees the reverse. vhost cannot > > open the group of an existing vfio-pci device, but vfio-pci can re-use > > the group that vhost has already opened. This is again where it seems > > like we're trying to cherry pick from an internal API and leaving some > > loose ends dangling. > > > > > > but I'm pretty sure the existing state machine in vfio isn't > > > > designed for it. Thanks, > > > > > > Yeah, more changes (e.g. refcnt, ..) in hw/vfio are needed to > > > make it work robustly. Hmm.. Is above idea (e.g. how vhost-user > > > is going to use VFIOGroup and how we are going to extend hw/vfio) > > > acceptable to you? Thanks! > > > > I certainly would not want to duplicate managing the group, container, > > and MemoryListener, but I think we need a more formal interface with at > > least some notion of reference counting beyond the device list or > > explicit knowledge of an external user. > > Got it! I'll try to address this. Thanks! > > > I'm also curious if it really > > makes sense that the vhost backend is opening the group rather than > > letting QEMU do it. It seems that vhost wants to deal with the device > > and we can never have symmetry in the scenario above, where a group > > might contain multiple devices and order matters because of where the > > group is opened. If we still had a notion of a VFIODevice for an > > external user, then the device_list refcnt would just work. Thanks, > > Vhost-user backend will but QEMU won't open the VFIO > fds, because QEMU just has a vhost-user UNIX socket > path that it should connect to or listen on. So QEMU > doesn't know which group and device it should open. > The vhost accelerator devices are probed and managed > in the vhost-user backend process. And the vhost-user > backend process will bind the vhost-user instances > and vhost accelerator devices based on some sort of > user's input. > > Best regards, > Tiwei Bie
Is the reason it's done like this because the backend is sharing the device with multiple QEMUs? I generally wonder how are restarts of the backend handled with this approach: closing the VFIO device tends to reset the whole device. -- MST