On 10/06/16 16:50, David Gibson wrote:
> On Thu, Jun 09, 2016 at 04:47:59PM +1000, Alexey Kardashevskiy wrote:
>> On 23/03/16 14:03, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 11:34:55AM +1100, Alexey Kardashevskiy wrote:
>>>> Uff, lost cc: list. Added back. Some comments below.
>>>>
>>>>
>>>> On 03/21/2016 04:19 PM, David Gibson wrote:
>>>>> On Fri, Mar 18, 2016 at 11:12:26PM +1100, Alexey Kardashevskiy wrote:
>>>>>> On March 15, 2016 17:29:26 David Gibson <da...@gibson.dropbear.id.au> 
>>>>>> wrote:
>>>>>>
>>>>>>> On Fri, Mar 11, 2016 at 10:09:50AM +1100, Alexey Kardashevskiy wrote:
>>>>>>>> On 03/10/2016 04:21 PM, David Gibson wrote:
>>>>>>>>> On Wed, Mar 09, 2016 at 08:20:12PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 03/09/2016 04:45 PM, David Gibson wrote:
>>>>>>>>>>> On Mon, Mar 07, 2016 at 02:41:17PM +1100, Alexey Kardashevskiy 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>>>>>>>>>>>> via hypercalls which take a logical bus id (LIOBN) as a target 
>>>>>>>>>>>> IOMMU
>>>>>>>>>>>> identifier. LIOBNs are made up, advertised to guest systems and
>>>>>>>>>>>> linked to IOMMU groups by the user space.
>>>>>>>>>>>> In order to enable acceleration for IOMMU operations in KVM, we 
>>>>>>>>>>>> need
>>>>>>>>>>>> to tell KVM the information about the LIOBN-to-group mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>>>>>>>>>>>> is added which accepts:
>>>>>>>>>>>> - a VFIO group fd and IO base address to find the actual hardware
>>>>>>>>>>>> TCE table;
>>>>>>>>>>>> - a LIOBN to assign to the found table.
>>>>>>>>>>>>
>>>>>>>>>>>> Before notifying KVM about new link, this check the group for being
>>>>>>>>>>>> registered with KVM device in order to release them at unexpected 
>>>>>>>>>>>> KVM
>>>>>>>>>>>> finish.
>>>>>>>>>>>>
>>>>>>>>>>>> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the 
>>>>>>>>>>>> user
>>>>>>>>>>>> space.
>>>>>>>>>>>>
>>>>>>>>>>>> While we are here, this also fixes VFIO KVM device compiling to 
>>>>>>>>>>>> let it
>>>>>>>>>>>> link to a KVM module.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>>>>>>>>>>> ---
>>>>>>>>>>>>  Documentation/virtual/kvm/devices/vfio.txt |  21 +++++-
>>>>>>>>>>>>  arch/powerpc/kvm/Kconfig                   |   1 +
>>>>>>>>>>>>  arch/powerpc/kvm/Makefile                  |   5 +-
>>>>>>>>>>>>  arch/powerpc/kvm/powerpc.c                 |   1 +
>>>>>>>>>>>>  include/uapi/linux/kvm.h                   |   9 +++
>>>>>>>>>>>>  virt/kvm/vfio.c                            | 106
>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>  6 files changed, 140 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>> b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> index ef51740..c0d3eb7 100644
>>>>>>>>>>>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>>>>>>>>>>>> @@ -16,7 +16,24 @@ Groups:
>>>>>>>>>>>>
>>>>>>>>>>>>  KVM_DEV_VFIO_GROUP attributes:
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device 
>>>>>>>>>>>> tracking
>>>>>>>>>>>> +  kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +  for the VFIO group.
>>>>>>>>>>>
>>>>>>>>>>> AFAICT these changes are accurate for VFIO as it is already, in 
>>>>>>>>>>> which
>>>>>>>>>>> case it might be clearer to put them in a separate patch.
>>>>>>>>>>>
>>>>>>>>>>>>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device
>>>>>>>> tracking
>>>>>>>>>>>> +  kvm_device_attr.addr points to an int32_t file descriptor
>>>>>>>>>>>> +  for the VFIO group.
>>>>>>>>>>>>
>>>>>>>>>>>> -For each, kvm_device_attr.addr points to an int32_t file 
>>>>>>>>>>>> descriptor
>>>>>>>>>>>> -for the VFIO group.
>>>>>>>>>>>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO 
>>>>>>>>>>>> group
>>>>>>>>>>>> +  kvm_device_attr.addr points to a struct:
>>>>>>>>>>>> +          struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +                  __u32   argsz;
>>>>>>>>>>>> +                  __s32   fd;
>>>>>>>>>>>> +                  __u32   liobn;
>>>>>>>>>>>> +                  __u8    pad[4];
>>>>>>>>>>>> +                  __u64   start_addr;
>>>>>>>>>>>> +          };
>>>>>>>>>>>> +          where
>>>>>>>>>>>> +          @argsz is the size of kvm_vfio_spapr_tce_liobn;
>>>>>>>>>>>> +          @fd is a file descriptor for a VFIO group;
>>>>>>>>>>>> +          @liobn is a logical bus id to be associated with the 
>>>>>>>>>>>> group;
>>>>>>>>>>>> +          @start_addr is a DMA window offset on the IO (PCI) bus
>>>>>>>>>>>
>>>>>>>>>>> For the cause of DDW and multiple windows, I'm assuming you can call
>>>>>>>>>>> this multiple times with different LIOBNs and the same IOMMU group?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes. It is called twice per each group (when DDW is activated) - for 
>>>>>>>>>> 32bit
>>>>>>>>>> and 64bit windows, this is why @start_addr is there.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> index 1059846..dfa3488 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Kconfig
>>>>>>>>>>>> @@ -65,6 +65,7 @@ config KVM_BOOK3S_64
>>>>>>>>>>>>    select KVM
>>>>>>>>>>>>    select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE
>>>>>>>>>>>>    select SPAPR_TCE_IOMMU if IOMMU_SUPPORT
>>>>>>>>>>>> +  select KVM_VFIO if VFIO
>>>>>>>>>>>>    ---help---
>>>>>>>>>>>>      Support running unmodified book3s_64 and book3s_32 guest 
>>>>>>>>>>>> kernels
>>>>>>>>>>>>      in virtual machines on book3s_64 host processors.
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> index 7f7b6d8..71f577c 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/Makefile
>>>>>>>>>>>> @@ -8,7 +8,7 @@ ccflags-y := -Ivirt/kvm -Iarch/powerpc/kvm
>>>>>>>>>>>>  KVM := ../../../virt/kvm
>>>>>>>>>>>>
>>>>>>>>>>>>  common-objs-y = $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>>>>>>>>>>>> -          $(KVM)/eventfd.o $(KVM)/vfio.o
>>>>>>>>>>>> +          $(KVM)/eventfd.o
>>>>>>>>>>>
>>>>>>>>>>> Please don't disable the VFIO device for the non-book3s case.  I 
>>>>>>>>>>> added
>>>>>>>>>>> it (even though it didn't do anything until now) so that libvirt
>>>>>>>>>>> wouldn't choke when it finds it's not available.  Obviously the new
>>>>>>>>>>> ioctl needs to be only for the right IOMMU setup, but the device
>>>>>>>>>>> itself should be available always.
>>>>>>>>>>
>>>>>>>>>> Ah. Ok, I'll fix this. I just wanted to be able to compile kvm as a 
>>>>>>>>>> module.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>  CFLAGS_e500_mmu.o := -I.
>>>>>>>>>>>>  CFLAGS_e500_mmu_host.o := -I.
>>>>>>>>>>>> @@ -87,6 +87,9 @@ endif
>>>>>>>>>>>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>>>>>>>>>>>    book3s_xics.o
>>>>>>>>>>>>
>>>>>>>>>>>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>>>>>>>>>>>> +  $(KVM)/vfio.o \
>>>>>>>>>>>> +
>>>>>>>>>>>>  kvm-book3s_64-module-objs += \
>>>>>>>>>>>>    $(KVM)/kvm_main.o \
>>>>>>>>>>>>    $(KVM)/eventfd.o \
>>>>>>>>>>>> diff --git a/arch/powerpc/kvm/powerpc.c 
>>>>>>>>>>>> b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> index 19aa59b..63f188d 100644
>>>>>>>>>>>> --- a/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> +++ b/arch/powerpc/kvm/powerpc.c
>>>>>>>>>>>> @@ -521,6 +521,7 @@ int kvm_vm_ioctl_check_extension(struct kvm
>>>>>>>> *kvm, long ext)
>>>>>>>>>>>>  #ifdef CONFIG_PPC_BOOK3S_64
>>>>>>>>>>>>    case KVM_CAP_SPAPR_TCE:
>>>>>>>>>>>>    case KVM_CAP_SPAPR_TCE_64:
>>>>>>>>>>>> +  case KVM_CAP_SPAPR_TCE_VFIO:
>>>>>>>>>>>>    case KVM_CAP_PPC_ALLOC_HTAB:
>>>>>>>>>>>>    case KVM_CAP_PPC_RTAS:
>>>>>>>>>>>>    case KVM_CAP_PPC_FIXUP_HCALL:
>>>>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>>>>> index 080ffbf..f1abbea 100644
>>>>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>>>>> @@ -1056,6 +1056,7 @@ struct kvm_device_attr {
>>>>>>>>>>>>  #define  KVM_DEV_VFIO_GROUP                       1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_ADD                  1
>>>>>>>>>>>>  #define   KVM_DEV_VFIO_GROUP_DEL                  2
>>>>>>>>>>>> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN  3
>>>>>>>>>>>>
>>>>>>>>>>>>  enum kvm_device_type {
>>>>>>>>>>>>    KVM_DEV_TYPE_FSL_MPIC_20        = 1,
>>>>>>>>>>>> @@ -1075,6 +1076,14 @@ enum kvm_device_type {
>>>>>>>>>>>>    KVM_DEV_TYPE_MAX,
>>>>>>>>>>>>  };
>>>>>>>>>>>>
>>>>>>>>>>>> +struct kvm_vfio_spapr_tce_liobn {
>>>>>>>>>>>> +  __u32   argsz;
>>>>>>>>>>>> +  __s32   fd;
>>>>>>>>>>>> +  __u32   liobn;
>>>>>>>>>>>> +  __u8    pad[4];
>>>>>>>>>>>> +  __u64   start_addr;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +
>>>>>>>>>>>>  /*
>>>>>>>>>>>>   * ioctls for VM fds
>>>>>>>>>>>>   */
>>>>>>>>>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>>>>>>>>>> index 1dd087d..87c771e 100644
>>>>>>>>>>>> --- a/virt/kvm/vfio.c
>>>>>>>>>>>> +++ b/virt/kvm/vfio.c
>>>>>>>>>>>> @@ -20,6 +20,10 @@
>>>>>>>>>>>>  #include <linux/vfio.h>
>>>>>>>>>>>>  #include "vfio.h"
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +#include <asm/kvm_ppc.h>
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  struct kvm_vfio_group {
>>>>>>>>>>>>    struct list_head node;
>>>>>>>>>>>>    struct vfio_group *vfio_group;
>>>>>>>>>>>> @@ -60,6 +64,22 @@ static void
>>>>>>>> kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>>>>>>>>>>>>    symbol_put(vfio_group_put_external_user);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +static int kvm_vfio_external_user_iommu_id(struct vfio_group 
>>>>>>>>>>>> *vfio_group)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +  int (*fn)(struct vfio_group *);
>>>>>>>>>>>> +  int ret = -1;
>>>>>>>>>>>
>>>>>>>>>>> Should this be -ESOMETHING?
>>>>>>>>>>>
>>>>>>>>>>>> +  fn = symbol_get(vfio_external_user_iommu_id);
>>>>>>>>>>>> +  if (!fn)
>>>>>>>>>>>> +          return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +  ret = fn(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +  symbol_put(vfio_external_user_iommu_id);
>>>>>>>>>>>> +
>>>>>>>>>>>> +  return ret;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>>  static bool kvm_vfio_group_is_coherent(struct vfio_group 
>>>>>>>>>>>> *vfio_group)
>>>>>>>>>>>>  {
>>>>>>>>>>>>    long (*fn)(struct vfio_group *, unsigned long);
>>>>>>>>>>>> @@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct
>>>>>>>> kvm_device *dev)
>>>>>>>>>>>>    mutex_unlock(&kv->lock);
>>>>>>>>>>>>  }
>>>>>>>>>>>>
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
>>>>>>>>>>>> +          struct vfio_group *vfio_group)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't this go in the same patch that introduced the attach
>>>>>>>>>>> function?
>>>>>>>>>>
>>>>>>>>>> Having less patches which touch different maintainers areas is 
>>>>>>>>>> better. I
>>>>>>>>>> cannot avoid touching both PPC KVM and VFIO in this patch but I can 
>>>>>>>>>> in
>>>>>>>>>> "[PATCH kernel 6/9] KVM: PPC: Associate IOMMU group with guest view 
>>>>>>>>>> of TCE
>>>>>>>>>> table".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +  int group_id;
>>>>>>>>>>>> +  struct iommu_group *grp;
>>>>>>>>>>>> +
>>>>>>>>>>>> +  group_id = kvm_vfio_external_user_iommu_id(vfio_group);
>>>>>>>>>>>> +  grp = iommu_group_get_by_id(group_id);
>>>>>>>>>>>> +  if (grp) {
>>>>>>>>>>>> +          kvm_spapr_tce_detach_iommu_group(kvm, grp);
>>>>>>>>>>>> +          iommu_group_put(grp);
>>>>>>>>>>>> +  }
>>>>>>>>>>>> +}
>>>>>>>>>>>> +#endif
>>>>>>>>>>>> +
>>>>>>>>>>>>  static int kvm_vfio_set_group(struct kvm_device *dev, long attr, 
>>>>>>>>>>>> u64 arg)
>>>>>>>>>>>>  {
>>>>>>>>>>>>    struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> @@ -186,6 +222,10 @@ static int kvm_vfio_set_group(struct 
>>>>>>>>>>>> kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>                            continue;
>>>>>>>>>>>>
>>>>>>>>>>>>                    list_del(&kvg->node);
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>
>>>>>>>>>>> Better to make a no-op version of the call than have to #ifdef at 
>>>>>>>>>>> the
>>>>>>>>>>> callsite.
>>>>>>>>>>
>>>>>>>>>> It is questionable. A x86 reader may decide that
>>>>>>>>>> KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN is implemented for x86 and get
>>>>>>>>>> confused.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +                  kvm_vfio_spapr_detach_iommu_group(dev->kvm,
>>>>>>>>>>>> +                                  kvg->vfio_group);
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>                    
>>>>>>>>>>>> kvm_vfio_group_put_external_user(kvg->vfio_group);
>>>>>>>>>>>>                    kfree(kvg);
>>>>>>>>>>>>                    ret = 0;
>>>>>>>>>>>> @@ -201,6 +241,69 @@ static int kvm_vfio_set_group(struct 
>>>>>>>>>>>> kvm_device
>>>>>>>> *dev, long attr, u64 arg)
>>>>>>>>>>>>            kvm_vfio_update_coherency(dev);
>>>>>>>>>>>>
>>>>>>>>>>>>            return ret;
>>>>>>>>>>>> +
>>>>>>>>>>>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>>>>>>>> +  case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: {
>>>>>>>>>>>> +          struct kvm_vfio_spapr_tce_liobn param;
>>>>>>>>>>>> +          unsigned long minsz;
>>>>>>>>>>>> +          struct kvm_vfio *kv = dev->private;
>>>>>>>>>>>> +          struct vfio_group *vfio_group;
>>>>>>>>>>>> +          struct kvm_vfio_group *kvg;
>>>>>>>>>>>> +          struct fd f;
>>>>>>>>>>>> +
>>>>>>>>>>>> +          minsz = offsetofend(struct kvm_vfio_spapr_tce_liobn,
>>>>>>>>>>>> +                          start_addr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +          if (copy_from_user(&param, (void __user *)arg, minsz))
>>>>>>>>>>>> +                  return -EFAULT;
>>>>>>>>>>>> +
>>>>>>>>>>>> +          if (param.argsz < minsz)
>>>>>>>>>>>> +                  return -EINVAL;
>>>>>>>>>>>> +
>>>>>>>>>>>> +          f = fdget(param.fd);
>>>>>>>>>>>> +          if (!f.file)
>>>>>>>>>>>> +                  return -EBADF;
>>>>>>>>>>>> +
>>>>>>>>>>>> +          vfio_group = kvm_vfio_group_get_external_user(f.file);
>>>>>>>>>>>> +          fdput(f);
>>>>>>>>>>>> +
>>>>>>>>>>>> +          if (IS_ERR(vfio_group))
>>>>>>>>>>>> +                  return PTR_ERR(vfio_group);
>>>>>>>>>>>> +
>>>>>>>>>>>> +          ret = -ENOENT;
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't there be some runtime test for the type of the IOMMU?  
>>>>>>>>>>> It's
>>>>>>>>>>> possible a kernel could be built for a platform supporting multiple
>>>>>>>>>>> IOMMU types.
>>>>>>>>>>
>>>>>>>>>> Well, may make sense but I do not know to test that. The IOMMU type 
>>>>>>>>>> is a
>>>>>>>>>> VFIO container property, not a group property and here (KVM) we only 
>>>>>>>>>> have
>>>>>>>>>> groups.
>>>>>>>>>
>>>>>>>>> Which, as mentioned previously, is broken.
>>>>>>>>
>>>>>>>> Which I am failing to follow you on this.
>>>>>>>>
>>>>>>>> What I am trying to achieve here is pretty much referencing a group so 
>>>>>>>> it
>>>>>>>> cannot be reused. Plus LIOBNs.
>>>>>>>
>>>>>>> "Plus LIOBNs" is not a trivial change.  You are establishing a linkage
>>>>>> >from LIOBNs to groups.  But that doesn't make sense; if mapping in one
>>>>>>> (guest) LIOBN affects a group it must affect all groups in the
>>>>>>> container.  i.e. LIOBN->container is the natural mapping, *not* LIOBN
>>>>>>> to group.
>>>>>>
>>>>>> I can see your point but i don't see how to proceed now, I'm totally 
>>>>>> stuck.
>>>>>> Pass container fd and then implement new api to lock containers somehow 
>>>>>> and
>>>>>
>>>>> I'm not really understanding what the question is about locking 
>>>>> containers.
>>>>>
>>>>>> enumerate groups when updating TCE table (including real mode)?
>>>>>
>>>>> Why do you need to enumerate groups?  The groups within the container
>>>>> share a TCE table, so can't you just update that once?
>>>>
>>>> Well, they share a TCE table but they do not share TCE Kill (TCE cache
>>>> invalidate) register address, it is still per PE but this does not matter
>>>> here (pnv_pci_link_table_and_group() does that), just mentioned to complete
>>>> the picture.
>>>
>>> True, you'll need to enumerate the groups for invalidates.  But you
>>> need that already, right.
>>>
>>>>>> Plus new API when we remove a group from a container as the result of 
>>>>>> guest
>>>>>> PCI hot unplug?
>>>>>
>>>>> I assume you mean a kernel internal API, since it shouldn't need
>>>>> anything else visible to userspace.  Won't this happen naturally?
>>>>> When the group is removed from the container, it will get its own TCE
>>>>> table instead of the previously shared one.
>>>>>
>>>>>>>> Passing a container fd does not make much
>>>>>>>> sense here as the VFIO device would walk through groups, reference 
>>>>>>>> them and
>>>>>>>> that is it, there is no locking on VFIO containters and so far there 
>>>>>>>> was no
>>>>>>>> need to teach KVM about containers.
>>>>>>>>
>>>>>>>> What do I miss now?
>>>>>>>
>>>>>>> Referencing the groups is essentially just a useful side effect.  The
>>>>>>> important functionality is informing VFIO of the guest LIOBNs; and
>>>>>>> LIOBNs map to containers, not groups.
>>>>>>
>>>>>> No. One liobn maps to one KVM-allocated TCE table, not a container. There
>>>>>> can be one or many or none containers per liobn.
>>>>>
>>>>> Ah, true.
>>>>
>>>> So I need to add new kernel API for KVM to get table(s) from VFIO
>>>> container(s). And invent some locking mechanism to prevent table(s) (or
>>>> associated container(s)) from going away, like
>>>> vfio_group_get_external_user/vfio_group_put_external_user but for
>>>> containers. Correct?
>>>
>>> Well, the container is attached to an fd, so if you get a reference on
>>> the file* that should do it.
>>
>> I am still trying to think of how to implement this suggestion.
>>
>> I need a way to tell KVM about IOMMU groups. vfio-pci driver is not right
>> interface as it knows nothing about KVM. There is VFIO-KVM device but it
>> does not have idea about containers.
>>
>> So I have to:
>>
>> Wenever a container is created or removed, notify the VFIO-KVM device by
>> passing there a container fd. ok.
> 
> Actually, I don't think the vfio-kvm device is really useful here.  It
> was designed as a hack for a particular problem on x86.  It certainly
> could be extended to cover the information we need here, but I don't
> think it's a particularly natural way of doing so.
> 
> The problem is that conveying the information from the vfio-kvm device
> to the real mode H_PUT_TCE handler, which is what really needs it,
> isn't particularly simpler than conveying that information from
> anywhere else.
> 
>> Then VFIO-KVM device needs to tell KVM about what iommu_table belongs to
>> what LIOBN so the realmode handlers could do the job. The real mode TCE
>> handlers get LIOBN, find a guest view table and update it. Now I want to
>> update the hardware table which is iommu_table attached to LIOBN.
>>
>> I did pass an IOMMU group fd to VFIO-KVM device. You suggested a container 
>> fd.
>>
>> Now VFIO-KVM device needs to extract iommu_table's from the container.
>> These iommu_table pointers are stored in "struct tce_container" which is
>> local to drivers/vfio/vfio_iommu_spapr_tce.c and not exported anyhow. So I
>> cannot export and use that.
>>
>> The other way to go would be adding API to VFIO to enumerate IOMMU groups
>> in a container and use iommu_table pointers stored in iommu_table_group of
>> each group (in fact the very first group will be enough as multiple groups
>> in a container share the table). Adding vfio_container_get_groups() when
>> only first one is needed is quite tricky in terms of maintainers approvals.
>>
>> So what would be the right course of action? Thanks.
> 
> So, from the user side, you need to be able to bind a VFIO backend to
> a particular guest IOMMU.  This suggests a new ioctl() used in
> conjunction with KVM_CREATE_SPAPR_TCE.  Let's call it
> KVM_SPAPR_TCE_BIND_VFIO.  You'd use KVM_CREATE_SPAPR_TCE to make the
> kernel aware of a LIOBN in the first place, then use
> KVM_SPAPR_TCE_BIND_VFIO to associate a VFIO container with that LIOBN.
> So it would be a VM ioctl, taking a LIOBN and a container fd.  You'd
> need a capability to go with it, and some way to unbind as well.


This is what I had in the first place some years ago. And after 5-6 reviews
I was told that there is a VFIO KVM and I should use it.


> To implement that, the ioctl() would need to use a new vfio (kernel
> internal) interface - which can be specific to only the spapr TCE
> type.  That would take the container fd, and return the list of
> iommu_tables in some form or other (or various error conditions,
> obviously).
> 
> So, when qemu creates the PHB, it uses KVM_CREATE_SPAPR_TCE to inform
> the kernel of the LIOBN.  When the VFIO device is attached to the PHB,
> it uses KVM_SPAPR_TCE_BIND_VFIO to connect the VFIO container to the
> LIOBN.  The ioctl() implementation uses the new special interface into
> the spapr_tce vfio backend to get the list of iommu tables, which it
> stores in some private format.

Getting just a list of IOMMU groups would do too. Pushing such API is a
problem, this is how I ended up with the current design.


> The H_PUT_TCE implementation uses that
> stored list of iommu tables to translate H_PUT_TCEs from the guest
> into actions on the host IOMMU tables.
> 
> And, yes, the special interface to the spapr TCE vfio back end is kind
> of a hack.  That's what you get when you need to link to separate
> kernel subsystems for performance reasons.

One can argue if it is a hack, how is this hack better that the existing
approach? :)

Alex, could you please comment on David's suggestion? Thanks!


-- 
Alexey

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to