On Thu, Feb 04, 2016 at 03:15:39PM -0500, Alex Williamson wrote: > > > ----- Original Message ----- > > On Thu, Feb 04, 2016 at 10:46:52AM -0700, Alex Williamson wrote: > > > On Thu, 4 Feb 2016 13:21:57 +0200 > > > "Michael S. Tsirkin" <m...@redhat.com> wrote: > > > > > > > On Thu, Feb 04, 2016 at 10:04:01AM +0800, Chen Fan wrote: > > > > > > > > > > On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote: > > > > > >On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote: > > > > > >>On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote: > > > > > >>>On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote: > > > > > >>>>From: Chen Fan <chen.fan.f...@cn.fujitsu.com> > > > > > >>>> > > > > > >>>>For now, for vfio pci passthough devices when qemu receives > > > > > >>>>an error from host aer report, currentlly just terminate the > > > > > >>>>guest, but usually user want to know what error occurred but > > > > > >>>>stopping the guest, so this patches add aer capability support > > > > > >>>>for vfio device, and pass the error to guest, and have guest > > > > > >>>>driver to recover from the error. > > > > > >>>I would like to see a version of this patchset that doesn't > > > > > >>>depend on pci core changes. > > > > > >>>I think that if you make this simplifying assumption: > > > > > >>> > > > > > >>>- all devices on same bus in guest are on same bus in host > > > > > >>> > > > > > >>>then you can handle both reset and hotplug simply in function 0 > > > > > >>>since it will belong to vfio. > > > > > >>> > > > > > >>>So we can have a version without pci core changes that simply > > > > > >>>assumes this, and things will just work. > > > > > >>> > > > > > >>> > > > > > >>>Now, if we wanted to enforce this limitation, I think the > > > > > >>>cleanest way would be to add a callback in struct PCIDevice: > > > > > >>> > > > > > >>> bool is_valid_function(PCIDevice *newfunction) > > > > > >>> > > > > > >>>and call it as each function is added. > > > > > >>>This way aer function can validate that each function > > > > > >>>added shares the same bus. > > > > > >>>And this way issues will be detected directly and not when > > > > > >>>function 0 is added. > > > > > >>> > > > > > >>>I would prefer this validation code to be a patch on top so we > > > > > >>>can merge the functionality directly and avoid blocking it while > > > > > >>>we figure out the best api to validate things. > > > > > >>> > > > > > >>>I don't see why making guest topology match host would > > > > > >>>ever be a problem, but if it's required to support > > > > > >>>configurations where these differ, I'd like to see > > > > > >>>an attempt to address that be split out, after aer > > > > > >>>is supported. > > > > > >>Hi Michael, > > > > > >> > > > > > >>Just think about this more, I think we also should check the vfio > > > > > >>devices whether on the same bus at the time of function 0 is > > > > > >>added. because we don't know the affected devices by a bus reset > > > > > >>have already all been assigned to VM. > > > > > >This is something vfio in kernel should check. > > > > > >You can't rely on qemu being well behaved, so don't > > > > > >even try to catch cases which would break host in userspace. > > > > > > > > > > > >qemu should only worry about not breaking guest. > > > > > > > > > > > > > > > > > >>for example, the multi-function's hotplug. > > > > > >>devices on same bus in host are added to VM one by one. when we > > > > > >>test one device, we haven't yet added the other devices. > > > > > >>so I think > > > > > >>the patch should like below. then we could add a > > > > > >>vfio_is_valid_function in vfio > > > > > >>to test each device whether the affected devices on the same bus. > > > > > >> > > > > > >>Thanks, > > > > > >>Chen > > > > > >> > > > > > >>diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > > > > >>index d940f79..7163b56 100644 > > > > > >>--- a/hw/pci/pci.c > > > > > >>+++ b/hw/pci/pci.c > > > > > >>@@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus, > > > > > >>int bus_num, uint8_t devfn) > > > > > >> return bus->devices[devfn]; > > > > > >> } > > > > > >> > > > > > >>+static int pci_bus_check_devices(PCIBus *bus) > > > > > >>+{ > > > > > >>+ PCIDeviceClass *pc; > > > > > >>+ int i, ret = 0; > > > > > >>+ > > > > > >>+ for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) { > > > > > >>+ if (!bus->devices[i]) { > > > > > >>+ continue; > > > > > >>+ } > > > > > >>+ > > > > > >>+ pc = PCI_DEVICE_GET_CLASS(bus->devices[i]); > > > > > >>+ if (!pc->is_valid_func) { > > > > > >>+ continue; > > > > > >>+ } > > > > > >>+ > > > > > >>+ ret = pc->is_valid_func(bus->devices[i], bus); > > > > > >>+ if (!ret) { > > > > > >>+ return -1; > > > > > >>+ } > > > > > >>+ } > > > > > >>+ return 0; > > > > > >>+} > > > > > >>+ > > > > > >>+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus) > > > > > >>+{ > > > > > >>+ if (pdev->bus == bus) { > > > > > >>+ return true; > > > > > >>+ } > > > > > >>+ > > > > > >>+ return false; > > > > > >>+} > > > > > >>+ > > > > > >I don't really understand what is this one doing. > > > > > >Why do we need a default function? > > > > > if the vfio driver in kernel can handle the bus reset for any one > > > > > device in qemu without the affected devices assigned. I think > > > > > we don't need this default one. > > > > > BTW, IIRC at present the devices on the same bus in host can > > > > > be assigned to different VM, so if we want to support this kind of > > > > > bus reset for an independent device when enable aer, aren't we > > > > > limiting the case that others devices on the same bus must be > > > > > assigned to current VM? > > > > > > > > > > Thanks, > > > > > Chen > > > > > > > > I don't believe this works at the moment, and > > > > I'd expect kernel to prevent this, > > > > so we should not rely on userspace code for this. > > > > Alex, could you comment please? > > > > > > DMA isolation and bus isolation are separate things. So long as > > > devices on the same bus are DMA isolated then they can be assigned to > > > separate VMs or split between host and VM. However, certain features > > > like bus reset are not available to the user unless they "own" all of > > > the DMA isolated sets affected by a bus reset. The kernel doesn't care > > > how the user is using them, but they must prove they own them through > > > vfio group file descriptors. > > > > OK this makes sense. > > So I think the solution is for userspace to make sure bus reset > > is available before exposing aer to guest. > > For example, attempt a bus reset. > > The API includes a mechanism for retrieving the affected devices without > making it necessary to actually perform a bus reset. > > > > I thought in previous discussions we decided that unused devices made > > > the problem set more complicated for userspace so we simplified by > > > requiring them to be assigned. For instance imagine a two function > > > device, with DMA isolation between functions, where we only want one > > > function assigned to the VM. QEMU would need to learn to take ownership > > > of the other function without exposing it to the VM simply for the > > > purpose of being able to perform a bus reset. > > > > Hmm this might be a security problem. > > How so? We don't simply simply trust that the user previously validated > the ability to do a bus reset and allow it any time they ask, each reset > needs to pass the file descriptors related to the affected devices.
Not a problem. A risk. Simply passing in each extra function to guest increases an attack surface. We should not expose devices to guest if it's not strictly required. > > Ideally we should not touch devices we don't *need* to touch. > > But given kernel requires this at the moment (IIUC) > > QEMU could open the device > > but not expose it to guest until actually asked. > > Well, a bus reset counts as needing to touch the device. I don't see > how it could not. Requiring QEMU to add a new mode of holding a device > only for ownership purposes sounds like an expansion of the scope of > these changes, which is exactly what we don't need at this point. All this can be done gradually, I have no problem with that. > > > Another simplification > > > was to expose them in the same slot, so we don't need to worry about VM > > > configurations where one device could be hot-unplugged and the > > > ownership released, breaking QEMU's ability to do a bus reset on the > > > remaining device. > > > > Same would apply here. > > Again, expanding scope versus configuration requirement, I take the > latter. Thanks, > > Alex