On 06/06/2014 09:36 AM, Alexander Graf wrote: > > On 06.06.14 01:17, Alexey Kardashevskiy wrote: >> On 06/06/2014 02:51 AM, Alexander Graf wrote: >>> On 05.06.14 16:33, Alexey Kardashevskiy wrote: >>>> On 06/05/2014 11:36 PM, Alexander Graf wrote: >>>>> On 05.06.14 15:33, Alexey Kardashevskiy wrote: >>>>>> On 06/05/2014 11:15 PM, Alexander Graf wrote: >>>>>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote: >>>>>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote: >>>>>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote: >>>>>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote: >>>>>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows >>>>>>>>>>> allocating >>>>>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests >>>>>>>>>>> targeted to specific LIOBN (logical bus number) right in the host >>>>>>>>>>> without >>>>>>>>>>> switching to QEMU. At the moment this is used for emulated devices >>>>>>>>>>> only >>>>>>>>>>> and the handler only puts TCE to the table. If the in-kernel >>>>>>>>>>> H_PUT_TCE >>>>>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to >>>>>>>>>>> the table and complete hypercall execution. The user space will >>>>>>>>>>> not be >>>>>>>>>>> notified. >>>>>>>>>>> >>>>>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device >>>>>>>>>>> class >>>>>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means >>>>>>>>>>> that TCE >>>>>>>>>>> tables for VFIO are going to be allocated in the host as well. >>>>>>>>>>> However VFIO operates with real IOMMU tables and simple copying of >>>>>>>>>>> a TCE to the real hardware TCE table will not work as guest >>>>>>>>>>> physical >>>>>>>>>>> to host physical address translation is requited. >>>>>>>>>>> >>>>>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we >>>>>>>>>>> better not >>>>>>>>>>> to register VFIO's TCE in the host. >>>>>>>>>>> >>>>>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device >>>>>>>>>>> telling >>>>>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host >>>>>>>>>>> kernel. >>>>>>>>>>> Instead, the table will be created in QEMU. >>>>>>>>>>> >>>>>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let >>>>>>>>>>> users >>>>>>>>>>> choose whether to use acceleration or not. At the moment it is >>>>>>>>>>> enabled >>>>>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to >>>>>>>>>>> false. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO, >>>>>>>>>>> emulated >>>>>>>>>>> PCI and VFIO which is a good thing. >>>>>>>>>>> >>>>>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO >>>>>>>>>>> capability but >>>>>>>>>>> this needs kernel update. >>>>>>>>>> Never mind, I'll make it a capability. I'll post capability >>>>>>>>>> reservation >>>>>>>>>> patch separately. >>>>>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to >>>>>>>>> true for >>>>>>>>> vfio and false for emulated devices. Then the spapr_iommu file can >>>>>>>>> check on >>>>>>>>> the capability (and default to false for now, since it doesn't exist >>>>>>>>> yet). >>>>>>>> Is that ok if the flag does not have to do anything with VFIO per >>>>>>>> se? :) >>>>>>> The flag means "use in-kernel acceleration if the vfio coupling >>>>>>> capability >>>>>>> is available", no? >>>>>> It is a flag of sPAPRTCETable which is not supposed to know about >>>>>> VFIO at >>>>>> all, it is just an IOMMU. But if you are ok with it, I have no reason >>>>>> to be >>>>>> unhappy either :) >>>>>> >>>>>> >>>>>> >>>>>>>>> That way you don't have to reserve a CAP today. >>>>>>>> Why exactly cannot we do that today? >>>>>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at. >>>>>>> Maybe we realize during patch review that we need completely different >>>>>>> CAPs. >>>>>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be >>>>>> available in >>>>>> the kernel. >>>>> So all you need are 64bit TCEs with bus_offset? >>>> No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is >>>> just 1 or 2GB and it is mapped at 0 on PCI bus. >>>> >>>> TCEs are 64 bit already. >>> Ok, so the guest has to tell the PCI device to write to a specific window. >>> That's a shame :). >> No. Guest tells the device some address, that's it. Guest allocates those >> addresses from some window which host, guest and PHB know about but not the >> device. What is a shame here? > > It would be nicer if the guest had full control over the virtual address > range of a PCI device. > >>>>> What about the missing >>>>> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's >>>>> what this is really about. >>>> This I do not understand :( >>> How does real mode H_PUT_TCE emulation know that it needs to notify user >>> space to establish the map? >> If it wants to pass control to the user space, it returns H_TOO_HARD. This >> happens, for example, if LIOBN was not registered in KVM. > > So how does KVM_CAP_SPAPR_TCE_64 help here? With KVM_CAP_SPAPR_TCE_64 we > can still not map VFIO devices' TCE tables because we're missing all the > magic to link the virtual TCE table to a physical TCE table.
It does not help here indeeed, I did not say it would ;) I just wanted to do the preparations first, and this means I need to reserve capability numbers (which is normally very tough process). Since one capability is straightforward to implement, I included this into the set. -- Alexey