On Tue, Mar 19, 2019 at 04:47:20PM +0100, Cédric Le Goater wrote:
> On 3/19/19 5:54 AM, David Gibson wrote:
> > On Mon, Mar 18, 2019 at 03:12:10PM +0100, Cédric Le Goater wrote:
> >> On 3/18/19 4:23 AM, David Gibson wrote:
> >>> On Fri, Mar 15, 2019 at 01:05:58PM +0100, Cédric Le Goater wrote:
> >>>> These controls will be used by the H_INT_SET_QUEUE_CONFIG and
> >>>> H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
> >>>> Event Queue in the XIVE IC. They will also be used to restore the
> >>>> configuration of the XIVE EQs and to capture the internal run-time
> >>>> state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
> >>>> the EQ toggle bit and EQ index which are updated by the XIVE IC when
> >>>> event notifications are enqueued in the EQ.
> >>>>
> >>>> The value of the guest physical address of the event queue is saved in
> >>>> the XIVE internal xive_q structure for later use. That is when
> >>>> migration needs to mark the EQ pages dirty to capture a consistent
> >>>> memory state of the VM.
> >>>>
> >>>> To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
> >>>> OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
> >>>> but restoring the EQ state will.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <c...@kaod.org>
> >>>> ---
> >>>>
> >>>>  Changes since v2 :
> >>>>  
> >>>>  - fixed comments on the KVM device attribute definitions
> >>>>  - fixed check on supported EQ size to restrict to 64K pages
> >>>>  - checked kvm_eq.flags that need to be zero
> >>>>  - removed the OPAL call when EQ qtoggle bit and index are zero. 
> >>>>
> >>>>  arch/powerpc/include/asm/xive.h            |   2 +
> >>>>  arch/powerpc/include/uapi/asm/kvm.h        |  21 ++
> >>>>  arch/powerpc/kvm/book3s_xive.h             |   2 +
> >>>>  arch/powerpc/kvm/book3s_xive.c             |  15 +-
> >>>>  arch/powerpc/kvm/book3s_xive_native.c      | 232 +++++++++++++++++++++
> >>>>  Documentation/virtual/kvm/devices/xive.txt |  31 +++
> >>>>  6 files changed, 297 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/include/asm/xive.h 
> >>>> b/arch/powerpc/include/asm/xive.h
> >>>> index b579a943407b..46891f321606 100644
> >>>> --- a/arch/powerpc/include/asm/xive.h
> >>>> +++ b/arch/powerpc/include/asm/xive.h
> >>>> @@ -73,6 +73,8 @@ struct xive_q {
> >>>>          u32                     esc_irq;
> >>>>          atomic_t                count;
> >>>>          atomic_t                pending_count;
> >>>> +        u64                     guest_qpage;
> >>>> +        u32                     guest_qsize;
> >>>>  };
> >>>>  
> >>>>  /* Global enable flags for the XIVE support */
> >>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> >>>> b/arch/powerpc/include/uapi/asm/kvm.h
> >>>> index 12bb01baf0ae..1cd728c87d7c 100644
> >>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>> @@ -679,6 +679,7 @@ struct kvm_ppc_cpu_char {
> >>>>  #define KVM_DEV_XIVE_GRP_CTRL           1
> >>>>  #define KVM_DEV_XIVE_GRP_SOURCE         2       /* 64-bit source 
> >>>> identifier */
> >>>>  #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG  3       /* 64-bit source 
> >>>> identifier */
> >>>> +#define KVM_DEV_XIVE_GRP_EQ_CONFIG      4       /* 64-bit EQ identifier 
> >>>> */
> >>>>  
> >>>>  /* Layout of 64-bit XIVE source attribute values */
> >>>>  #define KVM_XIVE_LEVEL_SENSITIVE        (1ULL << 0)
> >>>> @@ -694,4 +695,24 @@ struct kvm_ppc_cpu_char {
> >>>>  #define KVM_XIVE_SOURCE_EISN_SHIFT      33
> >>>>  #define KVM_XIVE_SOURCE_EISN_MASK       0xfffffffe00000000ULL
> >>>>  
> >>>> +/* Layout of 64-bit EQ identifier */
> >>>> +#define KVM_XIVE_EQ_PRIORITY_SHIFT      0
> >>>> +#define KVM_XIVE_EQ_PRIORITY_MASK       0x7
> >>>> +#define KVM_XIVE_EQ_SERVER_SHIFT        3
> >>>> +#define KVM_XIVE_EQ_SERVER_MASK         0xfffffff8ULL
> >>>> +
> >>>> +/* Layout of EQ configuration values (64 bytes) */
> >>>> +struct kvm_ppc_xive_eq {
> >>>> +        __u32 flags;
> >>>> +        __u32 qsize;
> >>>> +        __u64 qpage;
> >>>> +        __u32 qtoggle;
> >>>> +        __u32 qindex;
> >>>> +        __u8  pad[40];
> >>>> +};
> >>>> +
> >>>> +#define KVM_XIVE_EQ_FLAG_ENABLED        0x00000001
> >>>> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY  0x00000002
> >>>> +#define KVM_XIVE_EQ_FLAG_ESCALATE       0x00000004
> >>>> +
> >>>>  #endif /* __LINUX_KVM_POWERPC_H */
> >>>> diff --git a/arch/powerpc/kvm/book3s_xive.h 
> >>>> b/arch/powerpc/kvm/book3s_xive.h
> >>>> index ae26fe653d98..622f594d93e1 100644
> >>>> --- a/arch/powerpc/kvm/book3s_xive.h
> >>>> +++ b/arch/powerpc/kvm/book3s_xive.h
> >>>> @@ -272,6 +272,8 @@ struct kvmppc_xive_src_block 
> >>>> *kvmppc_xive_create_src_block(
> >>>>          struct kvmppc_xive *xive, int irq);
> >>>>  void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
> >>>>  int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
> >>>> +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
> >>>> +                                  bool single_escalation);
> >>>>  
> >>>>  #endif /* CONFIG_KVM_XICS */
> >>>>  #endif /* _KVM_PPC_BOOK3S_XICS_H */
> >>>> diff --git a/arch/powerpc/kvm/book3s_xive.c 
> >>>> b/arch/powerpc/kvm/book3s_xive.c
> >>>> index e09f3addffe5..c1b7aa7dbc28 100644
> >>>> --- a/arch/powerpc/kvm/book3s_xive.c
> >>>> +++ b/arch/powerpc/kvm/book3s_xive.c
> >>>> @@ -166,7 +166,8 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
> >>>>          return IRQ_HANDLED;
> >>>>  }
> >>>>  
> >>>> -static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
> >>>> +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
> >>>> +                                  bool single_escalation)
> >>>>  {
> >>>>          struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >>>>          struct xive_q *q = &xc->queues[prio];
> >>>> @@ -185,7 +186,7 @@ static int xive_attach_escalation(struct kvm_vcpu 
> >>>> *vcpu, u8 prio)
> >>>>                  return -EIO;
> >>>>          }
> >>>>  
> >>>> -        if (xc->xive->single_escalation)
> >>>> +        if (single_escalation)
> >>>>                  name = kasprintf(GFP_KERNEL, "kvm-%d-%d",
> >>>>                                   vcpu->kvm->arch.lpid, xc->server_num);
> >>>>          else
> >>>> @@ -217,7 +218,7 @@ static int xive_attach_escalation(struct kvm_vcpu 
> >>>> *vcpu, u8 prio)
> >>>>           * interrupt, thus leaving it effectively masked after
> >>>>           * it fires once.
> >>>>           */
> >>>> -        if (xc->xive->single_escalation) {
> >>>> +        if (single_escalation) {
> >>>>                  struct irq_data *d = 
> >>>> irq_get_irq_data(xc->esc_virq[prio]);
> >>>>                  struct xive_irq_data *xd = 
> >>>> irq_data_get_irq_handler_data(d);
> >>>>  
> >>>> @@ -291,7 +292,8 @@ static int xive_check_provisioning(struct kvm *kvm, 
> >>>> u8 prio)
> >>>>                          continue;
> >>>>                  rc = xive_provision_queue(vcpu, prio);
> >>>>                  if (rc == 0 && !xive->single_escalation)
> >>>> -                        xive_attach_escalation(vcpu, prio);
> >>>> +                        kvmppc_xive_attach_escalation(vcpu, prio,
> >>>> +                                                      
> >>>> xive->single_escalation);
> >>>>                  if (rc)
> >>>>                          return rc;
> >>>>          }
> >>>> @@ -1214,7 +1216,8 @@ int kvmppc_xive_connect_vcpu(struct kvm_device 
> >>>> *dev,
> >>>>                  if (xive->qmap & (1 << i)) {
> >>>>                          r = xive_provision_queue(vcpu, i);
> >>>>                          if (r == 0 && !xive->single_escalation)
> >>>> -                                xive_attach_escalation(vcpu, i);
> >>>> +                                kvmppc_xive_attach_escalation(
> >>>> +                                        vcpu, i, 
> >>>> xive->single_escalation);
> >>>>                          if (r)
> >>>>                                  goto bail;
> >>>>                  } else {
> >>>> @@ -1229,7 +1232,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device 
> >>>> *dev,
> >>>>          }
> >>>>  
> >>>>          /* If not done above, attach priority 0 escalation */
> >>>> -        r = xive_attach_escalation(vcpu, 0);
> >>>> +        r = kvmppc_xive_attach_escalation(vcpu, 0, 
> >>>> xive->single_escalation);
> >>>>          if (r)
> >>>>                  goto bail;
> >>>>  
> >>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
> >>>> b/arch/powerpc/kvm/book3s_xive_native.c
> >>>> index b841d339f674..42e824658a30 100644
> >>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
> >>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >>>> @@ -340,6 +340,226 @@ static int 
> >>>> kvmppc_xive_native_set_source_config(struct kvmppc_xive *xive,
> >>>>                                                         priority, 
> >>>> masked, eisn);
> >>>>  }
> >>>>  
> >>>> +static int xive_native_validate_queue_size(u32 qsize)
> >>>> +{
> >>>> +        /*
> >>>> +         * We only support 64K pages for the moment. This is also
> >>>> +         * advertised in the DT property "ibm,xive-eq-sizes"
> >>>
> >>> IIUC, that won't work properly if you had a guest using 4kiB pages.
> >>
> >>> That's fine, but do we have somewhere that checks for that case and
> >>> throws a suitable error?
> >>
> >> Not in the device. 
> >>
> >> So, we should check the current page_size of the guest ? Is there a way 
> >> to do that simply from KVM ?
> > 
> > Not really.  But I think I know where to make the necessary test, see
> > comment below..
> > 
> >>>> +         */
> >>>> +        switch (qsize) {
> >>>> +        case 0: /* EQ reset */
> >>>> +        case 16:
> >>>> +                return 0;
> >>>> +        case 12:
> >>>> +        case 21:
> >>>> +        case 24:
> >>>> +        default:
> >>>> +                return -EINVAL;
> >>>> +        }
> >>>> +}
> >>>> +
> >>>> +static int kvmppc_xive_native_set_queue_config(struct kvmppc_xive *xive,
> >>>> +                                               long eq_idx, u64 addr)
> >>>> +{
> >>>> +        struct kvm *kvm = xive->kvm;
> >>>> +        struct kvm_vcpu *vcpu;
> >>>> +        struct kvmppc_xive_vcpu *xc;
> >>>> +        void __user *ubufp = (u64 __user *) addr;
> >>>
> >>> Nit: that should be (void __user *) on the right, shouldn't it?
> >>
> >> yes.
> >>
> >>>
> >>>> +        u32 server;
> >>>> +        u8 priority;
> >>>> +        struct kvm_ppc_xive_eq kvm_eq;
> >>>> +        int rc;
> >>>> +        __be32 *qaddr = 0;
> >>>> +        struct page *page;
> >>>> +        struct xive_q *q;
> >>>> +
> >>>> +        /*
> >>>> +         * Demangle priority/server tuple from the EQ identifier
> >>>> +         */
> >>>> +        priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
> >>>> +                KVM_XIVE_EQ_PRIORITY_SHIFT;
> >>>> +        server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
> >>>> +                KVM_XIVE_EQ_SERVER_SHIFT;
> >>>> +
> >>>> +        if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq)))
> >>>> +                return -EFAULT;
> >>>> +
> >>>> +        vcpu = kvmppc_xive_find_server(kvm, server);
> >>>> +        if (!vcpu) {
> >>>> +                pr_err("Can't find server %d\n", server);
> >>>> +                return -ENOENT;
> >>>> +        }
> >>>> +        xc = vcpu->arch.xive_vcpu;
> >>>> +
> >>>> +        if (priority != xive_prio_from_guest(priority)) {
> >>>> +                pr_err("Trying to restore invalid queue %d for VCPU 
> >>>> %d\n",
> >>>> +                       priority, server);
> >>>> +                return -EINVAL;
> >>>> +        }
> >>>> +        q = &xc->queues[priority];
> >>>> +
> >>>> +        pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d 
> >>>> idx:%d\n",
> >>>> +                 __func__, server, priority, kvm_eq.flags,
> >>>> +                 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, 
> >>>> kvm_eq.qindex);
> >>>> +
> >>>> +        /*
> >>>> +         * We can not tune the EQ configuration from user space. All
> >>>> +         * is done in OPAL.
> >>>> +         */
> >>>> +        if (kvm_eq.flags != 0) {
> >>>> +                pr_err("invalid flags %d\n", kvm_eq.flags);
> >>>> +                return -EINVAL;
> >>>> +        }
> >>>> +
> >>>> +        rc = xive_native_validate_queue_size(kvm_eq.qsize);
> >>>> +        if (rc) {
> >>>> +                pr_err("invalid queue size %d\n", kvm_eq.qsize);
> >>>> +                return rc;
> >>>> +        }
> >>>> +
> >>>> +        /* reset queue and disable queueing */
> >>>> +        if (!kvm_eq.qsize) {
> >>>> +                q->guest_qpage = 0;
> >>>> +                q->guest_qsize = 0;
> >>>> +
> >>>> +                rc = xive_native_configure_queue(xc->vp_id, q, priority,
> >>>> +                                                 NULL, 0, true);
> >>>> +                if (rc) {
> >>>> +                        pr_err("Failed to reset queue %d for VCPU %d: 
> >>>> %d\n",
> >>>> +                               priority, xc->server_num, rc);
> >>>> +                        return rc;
> >>>> +                }
> >>>> +
> >>>> +                if (q->qpage) {
> >>>> +                        put_page(virt_to_page(q->qpage));
> >>>> +                        q->qpage = NULL;
> >>>> +                }
> >>>> +
> >>>> +                return 0;
> >>>> +        }
> >>>> +
> >>>> +
> >>>> +        page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage));
> >>>> +        if (is_error_page(page)) {
> >>>> +                pr_warn("Couldn't get guest page for %llx!\n", 
> >>>> kvm_eq.qpage);
> >>>> +                return -EINVAL;
> >>>> +        }
> >>>
> >>> Yeah.. for the case of a 4kiB page host (these days weird, but not
> >>> actually prohibited, AFAIK) you need to check that the qsize selected
> >>> actually fits within the page.
> >>
> >> Ah yes. sure.
> > 
> > I think the pagesize test belongs here.  Rather than thinking about
> > the pagesize of the guest overall, you can check that this specific
> > page (possibly compound) is large enough to take the requested queue
> > size.
> 
> OK. It think kvm_host_page_size() is what we need. It returns the page
> size of the underlying VMA of the memblock holding the gfn. So I am going 
> to add :

Yes, that sounds good.

> +     page_size = kvm_host_page_size(kvm, gfn);
> +     if (1ull << kvm_eq.qshift > page_size) {
> +             pr_warn("Incompatible host page size %lx!\n", page_size);
> +             return -EINVAL;
> +     }
> +
> 
> Also I am renaming 'qsize' in 'qshift' and renaming 'qpage' to 'qaddr'.
> 
> > That should be enough to protect the host - it ensures that userspace
> > owns a suitable contiguous chunk of memory for the XIVE to write the
> > queue into.
> > 
> > It's possible there are weirder edge cases with a large page that's
> > not fully mapped into the guest - if necessary we can add tests for
> > that on the qemu side.
> > 
> > Oh.. it occurs to me that we might need to pin the queue page to make
> > sure it doesn't get swapped out or page-migrated while the XIVE holds
> > a pointer to it
> 
> That is what gfn_to_page() ends up doing by calling hva_to_pfn().

Ah, ok.

> >>>> +        /*
> >>>> +         * Return some information on the EQ configuration in
> >>>> +         * OPAL. This is purely informative for now as we can't really
> >>>> +         * tune the EQ configuration from user space.
> >>>> +         */
> >>>> +        kvm_eq.flags = 0;
> >>>> +        if (qflags & OPAL_XIVE_EQ_ENABLED)
> >>>> +                kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ENABLED;
> >>>> +        if (qflags & OPAL_XIVE_EQ_ALWAYS_NOTIFY)
> >>>> +                kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
> >>>> +        if (qflags & OPAL_XIVE_EQ_ESCALATE)
> >>>> +                kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ESCALATE;
> >>>
> >>> If there's not really anything it can do about it, does it make sense
> >>> to even expose this info to userspace?
> >>
> >> Hmm, good question. 
> >>
> >>  - KVM_XIVE_EQ_FLAG_ENABLED                
> >>    may be uselessly obvious.
> > 
> > What's it controlled by?
> 
> OPAL only. It's equivalent to the VALID bit in the XIVE END structure. 
> We can drop this one. 

Ok.

> >>  - KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY  
> >>    means we do not use the END ESBs to coalesce the events at the END
> >>    level. This flag is reflected by the XIVE_EQ_ALWAYS_NOTIFY option
> >>    in the sPAPR specs. We don't support the END ESBs but we might one
> >>    day.
> > 
> > Since the guest isn't currently permitted to set this, it should never
> > be set here either, no?
> 
> The OS does not support the END ESBs so this flag is always set in the 
> Linux XIVE driver. END ESBs are supported in the emulated device but not 
> in KVM/OPAL.

Ok, it's reasonable to keep this then.

> >>  - KVM_XIVE_EQ_FLAG_ESCALATE
> >>    means the EQ is an escalation. QEMU doesn't really care for now 
> >>    but it's an important information I think.
> > 
> > Likewise this one, yes?
> 
> That is an hypervisor information on the nature of the EQ, so it is 
> for QEMU and not the guest. I am not sure it is important today as
> we don't support escalation in QEMU. May be drop ? 

Yes, I think drop it.  We can add something later if we have a
concrete use case for it.

> >> I tried not to add too many of the END flags, only the relevant ones which
> >> could have an impact in the future modeling.  
> >>
> >> I think KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY is important. I was setting it from
> >> QEMU in the hcall but as OPAL does the same blindly I removed it in
> >> v3.
> > 
> > So, I might have misinterpreted this a bit the first time around.  Am
> > I correct in thinking that these bits all correspond to defined
> > options in the PAPR hcall
> 
> Yes for KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY which is the only flag defined 
> in sPAPR. 
> 
> > - but that for now we don't allow guests to
> > set them (because we haven't implemented support so far).
> 
> It's a bit more complex.
> 
> KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY is a "required" flag as the OS does not 
> support END ESBs. END ESBs are not supported in KVM/OPAL but they are 
> in the QEMU device. 

So, the (current) guest needs this behaviour, but I'm guessing PAPR
doesn't require it.  Is this behaviour configurable in the PAPR interface?

> I think it would be good for consistency to set KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY
> when calling kvmppc_xive_native_get_queue_config() from QEMU, as I did 
> in v2. But as OPAL forces the same behavior without any flag, it is not 
> really needed at the KVM level ...
> 
> Please tell me.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature

Reply via email to