Re: [PATCH v3 02/12] KVM: define common __KVM_GUESTDBG_USE_SW/HW_BP values

2015-05-08 Thread Paolo Bonzini


On 08/05/2015 11:23, Christoffer Dall wrote:
> On Wed, May 06, 2015 at 05:23:17PM +0100, Alex Bennée wrote:
>> Currently x86, powerpc and soon arm64 use the same two architecture
>> specific bits for guest debug support for software and hardware
>> breakpoints. This makes the shared values explicit while leaving the
>> gate open for another architecture to use some other value if they
>> really really want to.
>>
>> Signed-off-by: Alex Bennée 
>> Reviewed-by: Andrew Jones 
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
>> b/arch/powerpc/include/uapi/asm/kvm.h
>> index ab4d473..1731569 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -310,8 +310,8 @@ struct kvm_guest_debug_arch {
>>   * and upper 16 bits are architecture specific. Architecture specific 
>> defines
>>   * that ioctl is for setting hardware breakpoint or software breakpoint.
>>   */
>> -#define KVM_GUESTDBG_USE_SW_BP  0x0001
>> -#define KVM_GUESTDBG_USE_HW_BP  0x0002
>> +#define KVM_GUESTDBG_USE_SW_BP  __KVM_GUESTDBG_USE_SW_BP
>> +#define KVM_GUESTDBG_USE_HW_BP  __KVM_GUESTDBG_USE_HW_BP
>>  
>>  /* definition of registers in kvm_run */
>>  struct kvm_sync_regs {
>> diff --git a/arch/x86/include/uapi/asm/kvm.h 
>> b/arch/x86/include/uapi/asm/kvm.h
>> index d7dcef5..1438202 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -250,8 +250,8 @@ struct kvm_debug_exit_arch {
>>  __u64 dr7;
>>  };
>>  
>> -#define KVM_GUESTDBG_USE_SW_BP  0x0001
>> -#define KVM_GUESTDBG_USE_HW_BP  0x0002
>> +#define KVM_GUESTDBG_USE_SW_BP  __KVM_GUESTDBG_USE_SW_BP
>> +#define KVM_GUESTDBG_USE_HW_BP  __KVM_GUESTDBG_USE_HW_BP
>>  #define KVM_GUESTDBG_INJECT_DB  0x0004
>>  #define KVM_GUESTDBG_INJECT_BP  0x0008
>>  
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 70ac641..3b6252e 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -570,8 +570,16 @@ struct kvm_s390_irq_state {
>>  
>>  /* for KVM_SET_GUEST_DEBUG */
>>  
>> -#define KVM_GUESTDBG_ENABLE 0x0001
>> -#define KVM_GUESTDBG_SINGLESTEP 0x0002
>> +#define KVM_GUESTDBG_ENABLE (1 << 0)
>> +#define KVM_GUESTDBG_SINGLESTEP (1 << 1)
>> +
>> +/*
>> + * Architecture specific stuff uses the top 16 bits of the field,
> 
> s/stuff//
> 
>> + * however there is some shared commonality for the common cases
>> + */
>> +#define __KVM_GUESTDBG_USE_SW_BP(1 << 16)
>> +#define __KVM_GUESTDBG_USE_HW_BP(1 << 17)
>> +
>>  
>>  struct kvm_guest_debug {
>>  __u32 control;
> 
> We sort of left this discussion hanging with me expressing slight
> concern about the usefulness about these defines.
> 
> Paolo, what are your thoughts?

I would just lift these two KVM_GUESTDBG_* defines to
include/uapi/linux/kvm.h and say that architecture specific stuff uses
the top 14 bits of the field. :)

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/kvm/cma: Fix panic introduces by signed shift operation

2014-09-03 Thread Paolo Bonzini
Il 02/09/2014 18:13, Laurent Dufour ha scritto:
> fc95ca7284bc54953165cba76c3228bd2cdb9591 introduces a memset in
> kvmppc_alloc_hpt since the general CMA doesn't clear the memory it
> allocates.
> 
> However, the size argument passed to memset is computed from a signed value
> and its signed bit is extended by the cast the compiler is doing. This lead
> to extremely large size value when dealing with order value >= 31, and
> almost all the memory following the allocated space is cleaned. As a
> consequence, the system is panicing and may even fail spawning the kdump
> kernel.
> 
> This fix makes use of an unsigned value for the memset's size argument to
> avoid sign extension. Among this fix, another shift operation which may
> lead to signed extended value too is also fixed.
> 
> Cc: Alexey Kardashevskiy 
> Cc: Paul Mackerras 
> Cc: Alexander Graf 
> Cc: Aneesh Kumar K.V 
> Cc: Joonsoo Kim 
> Cc: Benjamin Herrenschmidt 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 72c20bb16d26..79294c4c5015 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -62,10 +62,10 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
>   }
>  
>   kvm->arch.hpt_cma_alloc = 0;
> - page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT));
> + page = kvm_alloc_hpt(1ul << (order - PAGE_SHIFT));
>   if (page) {
>   hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
> - memset((void *)hpt, 0, (1 << order));
> + memset((void *)hpt, 0, (1ul << order));
>   kvm->arch.hpt_cma_alloc = 1;
>   }
>  
> 

Thanks, applied to kvm/master.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode

2014-09-29 Thread Paolo Bonzini
Il 29/09/2014 10:28, Alexander Graf ha scritto:
> 
> 
> On 29.09.14 10:02, Aneesh Kumar K.V wrote:
>> We use cma reserved area for creating guest hash page table.
>> Don't do the reservation in non-hypervisor mode. This avoids unnecessary
>> CMA reservation when booting with limited memory configs like
>> fadump and kdump.
>>
>> Signed-off-by: Aneesh Kumar K.V 
> 
> Thanks, applied to kvm-ppc-queue.

Would you like this in 3.18?

Paolo

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode

2014-09-29 Thread Paolo Bonzini
Il 29/09/2014 13:57, Alexander Graf ha scritto:
> 
> 
> On 29.09.14 13:48, Paolo Bonzini wrote:
>> Il 29/09/2014 10:28, Alexander Graf ha scritto:
>>>
>>>
>>> On 29.09.14 10:02, Aneesh Kumar K.V wrote:
>>>> We use cma reserved area for creating guest hash page table.
>>>> Don't do the reservation in non-hypervisor mode. This avoids unnecessary
>>>> CMA reservation when booting with limited memory configs like
>>>> fadump and kdump.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V 
>>>
>>> Thanks, applied to kvm-ppc-queue.
>>
>> Would you like this in 3.18?
> 
> Yes, can you please directly apply it with my SoB (or Reviewed-by if you
> prefer)?

Ok, will do.

Paolo

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: lockdep warning with 2d65a9f48fcdf7866aab6457bc707ca233e0c791

2014-10-28 Thread Paolo Bonzini


On 10/15/2014 07:28 PM, Aneesh Kumar K.V wrote:
> 
> =
> [ INFO: possible recursive locking detected ]
> 3.17.0+ #31 Not tainted
> -
> qemu-system-ppc/9112 is trying to acquire lock:
>  (&(&vcpu->arch.tbacct_lock)->rlock){..}, at: [] 
> .vcore_stolen_time+0x44/0xb0 [kvm_hv]
> 
> but task is already holding lock:
>  (&(&vcpu->arch.tbacct_lock)->rlock){..}, at: [] 
> .kvmppc_remove_runnable.part.2+0x34/0xd0 [kvm_hv]

This must come from here:

while (vcpu->arch.state == KVMPPC_VCPU_RUNNABLE &&
   (vc->vcore_state == VCORE_RUNNING ||
vc->vcore_state == VCORE_EXITING)) {
spin_unlock(&vc->lock);
kvmppc_wait_for_exec(vcpu, TASK_UNINTERRUPTIBLE);
spin_lock(&vc->lock);
}

if (vcpu->arch.state == KVMPPC_VCPU_RUNNABLE) {
kvmppc_remove_runnable(vc, vcpu);
vcpu->stat.signal_exits++;
kvm_run->exit_reason = KVM_EXIT_INTR;
vcpu->arch.ret = -EINTR;
}


if vc->vcore_state is VCORE_SLEEPING (I think it cannot be VCORE_STARTING)?

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

2013-05-27 Thread Paolo Bonzini
Il 21/05/2013 05:06, Alexey Kardashevskiy ha scritto:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.

Do you mean vio?  virtio (without getting into whether that's good or
bad) will bypass the iommu.

Paolo

> These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
> 
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
> 
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
> 
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> Signed-off-by: Paul Mackerras 
> 
> ---
> Changelog:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
> ---
>  Documentation/virtual/kvm/api.txt   |   14 ++
>  arch/powerpc/include/asm/kvm_host.h |2 +
>  arch/powerpc/include/asm/kvm_ppc.h  |   16 +-
>  arch/powerpc/kvm/book3s_64_vio.c|  118 ++
>  arch/powerpc/kvm/book3s_64_vio_hv.c |  266 
> +++
>  arch/powerpc/kvm/book3s_hv.c|   39 +
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |6 +
>  arch/powerpc/kvm/book3s_pr_papr.c   |   37 -
>  arch/powerpc/kvm/powerpc.c  |3 +
>  include/uapi/linux/kvm.h|1 +
>  10 files changed, 470 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 5f91eda..3c7c7ea 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
>  args[1] is the XICS CPU number (server ID) for this vcpu
>  
>  This capability connects the vcpu to an in-kernel XICS device.
> +
> +6.8 KVM_CAP_PPC_MULTITCE
> +
> +Architectures: ppc
> +Parameters: none
> +Returns: 0 on success; -1 on error
> +
> +This capability enables the guest to put/remove multiple TCE entries
> +per hypercall which significanly accelerates DMA operations for PPC KVM
> +guests.
> +
> +When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
> +expected to occur rather than H_PUT_TCE which supports only one TCE entry
> +per call.
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>   spinlock_t tbacct_lock;
>   u64 busy_stolen;
>   u64 busy_preempt;
> +
> + unsigned long *tce_tmp;/* TCE cache for TCE_PUT_INDIRECT hall */
>  #endif
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..e852921b 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu 
> *vcpu);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> -  unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> + struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> + unsigned long ioba, unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, 

Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-27 Thread Paolo Bonzini
Il 25/05/2013 04:45, David Gibson ha scritto:
>> >+   case KVM_CREATE_SPAPR_TCE_IOMMU: {
>> >+   struct kvm_create_spapr_tce_iommu create_tce_iommu;
>> >+   struct kvm *kvm = filp->private_data;
>> >+
>> >+   r = -EFAULT;
>> >+   if (copy_from_user(&create_tce_iommu, argp,
>> >+   sizeof(create_tce_iommu)))
>> >+   goto out;
>> >+   r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>> >&create_tce_iommu);
>> >+   goto out;
>> >+   }

Would it make sense to make this the only interface for creating TCEs?
That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
no hardware IOMMU usage), and have a single ioctl for both cases?
There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
kvm_vm_ioctl_create_spapr_tce_iommu.

KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
could just use a new capability and drop the old ioctl.  I'm not sure
whether you're already considering the ABI to be stable for kvmppc.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-27 Thread Paolo Bonzini
Il 27/05/2013 16:26, Alexey Kardashevskiy ha scritto:
> On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
>> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>>> + case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>>> + struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>>> + struct kvm *kvm = filp->private_data;
>>>>> +
>>>>> + r = -EFAULT;
>>>>> + if (copy_from_user(&create_tce_iommu, argp,
>>>>> + sizeof(create_tce_iommu)))
>>>>> + goto out;
>>>>> + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>>> &create_tce_iommu);
>>>>> + goto out;
>>>>> + }
>>
>> Would it make sense to make this the only interface for creating TCEs?
>> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
>> no hardware IOMMU usage), and have a single ioctl for both cases?
>> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
>> kvm_vm_ioctl_create_spapr_tce_iommu.
> 
> Just few bits. Is there really much sense in making one function from those
> two? I tried, looked a bit messy.

Cannot really tell without the userspace bits.  But ioctl proliferation
is what the device and one_reg APIs were supposed to avoid...

>> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
>> could just use a new capability and drop the old ioctl.
> 
> The old capability+ioctl already exist for quite a while and few QEMU
> versions supporting it were released so we do not want just drop it. So
> then what is the benefit of having a new interface with support of both types?
> 
>>  I'm not sure
>> whether you're already considering the ABI to be stable for kvmppc.
> 
> Is any bit of KVM using it? Cannot see from Documentation/ABI.

I mean the userspace ABI (ioctls).

Paolo

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-06-17 Thread Paolo Bonzini
Il 05/06/2013 08:11, Alexey Kardashevskiy ha scritto:
> +/*
> + * The KVM guest can be backed with 16MB pages (qemu switch
> + * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).

Nitpick: we try to avoid references to QEMU, so perhaps

s/qemu switch/for example, with QEMU you can use the command-line option/

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 29/45] kvm/vmx: Use get/put_online_cpus_atomic() to prevent CPU offline

2013-06-26 Thread Paolo Bonzini
Il 25/06/2013 22:30, Srivatsa S. Bhat ha scritto:
> - cpu = get_cpu();
> + cpu = get_online_cpus_atomic();
>   vmx_vcpu_load(&vmx->vcpu, cpu);
>   vmx->vcpu.cpu = cpu;
>   err = vmx_vcpu_setup(vmx);
>   vmx_vcpu_put(&vmx->vcpu);
> - put_cpu();
> + put_online_cpus_atomic();

The new API has a weird name.  Why are you adding new functions instead
of just modifying get/put_cpu?

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 28/45] KVM: Use get/put_online_cpus_atomic() to prevent CPU offline

2013-06-26 Thread Paolo Bonzini
Il 25/06/2013 22:30, Srivatsa S. Bhat ha scritto:
> Once stop_machine() is gone from the CPU offline path, we won't be able
> to depend on disabling preemption to prevent CPUs from going offline
> from under us.
> 
> Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going
> offline, while invoking from atomic context.
> 
> Cc: Gleb Natapov 
> Cc: Paolo Bonzini 
> Cc: k...@vger.kernel.org
> Signed-off-by: Srivatsa S. Bhat 
> ---
> 
>  virt/kvm/kvm_main.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 302681c..5bbfa30 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -174,7 +174,7 @@ static bool make_all_cpus_request(struct kvm *kvm, 
> unsigned int req)
>  
>   zalloc_cpumask_var(&cpus, GFP_ATOMIC);
>  
> - me = get_cpu();
> + me = get_online_cpus_atomic();
>   kvm_for_each_vcpu(i, vcpu, kvm) {
>   kvm_make_request(req, vcpu);
>   cpu = vcpu->cpu;
> @@ -192,7 +192,7 @@ static bool make_all_cpus_request(struct kvm *kvm, 
> unsigned int req)
>   smp_call_function_many(cpus, ack_flush, NULL, 1);
>   else
>   called = false;
> - put_cpu();
> + put_online_cpus_atomic();
>   free_cpumask_var(cpus);
>   return called;
>  }
> @@ -1707,11 +1707,11 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
>   ++vcpu->stat.halt_wakeup;
>   }
>  
> - me = get_cpu();
> + me = get_online_cpus_atomic();
>   if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
>   if (kvm_arch_vcpu_should_kick(vcpu))
>   smp_send_reschedule(cpu);
> - put_cpu();
> + put_online_cpus_atomic();
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
>  #endif /* !CONFIG_S390 */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Acked-by: Paolo Bonzini 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 29/45] kvm/vmx: Use get/put_online_cpus_atomic() to prevent CPU offline

2013-06-26 Thread Paolo Bonzini
Il 26/06/2013 10:06, Srivatsa S. Bhat ha scritto:
> On 06/26/2013 01:16 PM, Paolo Bonzini wrote:
>> Il 25/06/2013 22:30, Srivatsa S. Bhat ha scritto:
>>> -   cpu = get_cpu();
>>> +   cpu = get_online_cpus_atomic();
>>> vmx_vcpu_load(&vmx->vcpu, cpu);
>>> vmx->vcpu.cpu = cpu;
>>> err = vmx_vcpu_setup(vmx);
>>> vmx_vcpu_put(&vmx->vcpu);
>>> -   put_cpu();
>>> +   put_online_cpus_atomic();
>>
>> The new API has a weird name.  Why are you adding new functions instead
>> of just modifying get/put_cpu?
>>
> 
> Because the purpose of those two functions are distinctly different
> from each other.
> 
> get/put_cpu() is used to disable preemption on the local CPU. (Which
> also disables offlining the local CPU during that critical section).

Ok, then I understood correctly... and I acked the other KVM patch.

However, keeping the code on the local CPU is exactly the point of this
particular use of get_cpu()/put_cpu().  Why does it need to synchronize
with offlining of other CPUs?

Paolo

> What this patchset deals with is synchronizing with offline of *any*
> CPU. Typically, we use get_online_cpus()/put_online_cpus() for that
> purpose. But they can't be used in atomic context, because they take
> mutex locks and hence can sleep.
> 
> So the code that executes in atomic context and which wants to prevent
> *any* CPU from going offline, used to disable preemption around its
> critical section. Disabling preemption prevents stop_machine(), and
> CPU offline (of *any* CPU) was done via stop_machine(). So disabling
> preemption disabled any CPU from going offline, as a *side-effect*.
> 
> And this patchset prepares the ground for getting rid of stop_machine()
> in the CPU offline path. Which means, disabling preemption only prevents
> the *local* CPU from going offline. So if code in atomic context wants
> to prevent any CPU from going offline, we need a new set of APIs, like
> get/put_online_cpus(), but which can be invoked from atomic context.
> That's why I named it as get/put_online_cpus_atomic().
> 
> One of the key points here is that we want to preserve get/put_cpu()
> as it is, since its purpose is different - disable preemption and
> offline of the local CPU. There is no reason to change that API, its
> useful as it is.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 29/45] kvm/vmx: Use get/put_online_cpus_atomic() to prevent CPU offline

2013-06-26 Thread Paolo Bonzini
Il 26/06/2013 10:41, Srivatsa S. Bhat ha scritto:
> On 06/26/2013 01:53 PM, Paolo Bonzini wrote:
>> Il 26/06/2013 10:06, Srivatsa S. Bhat ha scritto:
>>> On 06/26/2013 01:16 PM, Paolo Bonzini wrote:
>>>> Il 25/06/2013 22:30, Srivatsa S. Bhat ha scritto:
>>>>> - cpu = get_cpu();
>>>>> + cpu = get_online_cpus_atomic();
>>>>>   vmx_vcpu_load(&vmx->vcpu, cpu);
>>>>>   vmx->vcpu.cpu = cpu;
>>>>>   err = vmx_vcpu_setup(vmx);
>>>>>   vmx_vcpu_put(&vmx->vcpu);
>>>>> - put_cpu();
>>>>> + put_online_cpus_atomic();
>>>>
>>>> The new API has a weird name.  Why are you adding new functions instead
>>>> of just modifying get/put_cpu?
>>>>
>>>
>>> Because the purpose of those two functions are distinctly different
>>> from each other.
>>>
>>> get/put_cpu() is used to disable preemption on the local CPU. (Which
>>> also disables offlining the local CPU during that critical section).
>>
>> Ok, then I understood correctly... and I acked the other KVM patch.
>>
> 
> Thank you!
>  
>> However, keeping the code on the local CPU is exactly the point of this
>> particular use of get_cpu()/put_cpu().  Why does it need to synchronize
>> with offlining of other CPUs?
> 
> Now that I looked at it again, I think you are right, get/put_cpu() is
> good enough here.
> 
> But let me explain why I initially thought we needed full synchronization
> with CPU offline. In short, I wanted to synchronize the calls to
> __loaded_vmcs_clear(). We have the scenario shown below:
> 
> CPU offline:
>   CPU_DYING:
>   hardware_disable();
>   ->vmclear_local_loaded_vmcss();
> ->__loaded_vmcs_clear(v);
> 
> 
> 
> And vmx_vcpu_load() (among others) can do:
>vmx_vcpu_load();
>-> loaded_vmcs_clear();
>   -> __loaded_vmcs_clear();
> 
> 
> So I wanted to avoid this race-condition and hence wrapped the code with
> get/put_online_cpus_atomic().
> 
> But the point I missed earlier is that loaded_vmcs_clear() calls
> __loaded_vmcs_clear() using smp_call_function_single(), which itself
> synchronizes properly with CPU hotplug. So there is no need to add full
> hotplug synchronization in the vmx code, as you noted above.

Makes sense, and I see now that it's patch 9 in this series.

In general, I would rather add an extra get_online_cpus_atomic pair
where it it actually needed (i.e. closer to where cpu_online is actually
used), and leave get_cpu/put_cpu as is in the caller... which is exactly
what happens in this case, since "where it is actually needed" is "in
smp_call_function_single()".

> So, please ignore this patch, and sorry for the noise!

No problem, thanks for the heads-up.

Paolo

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-22 Thread Paolo Bonzini


On 21/07/2015 15:55, Vlastimil Babka wrote:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2d73807..a8723a8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
>   struct page *pages;
>   struct vmcs *vmcs;
>  
> - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
> + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
>   if (!pages)
>   return NULL;
>   vmcs = page_address(pages);

Even though there's a pretty strong preference for the "right" node,
things can work if the node is the wrong one.  The order is always zero
in practice, so the allocation should succeed.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm:powerpc:Fix error handling in the function mpic_set_default_irq_routing

2015-08-07 Thread Paolo Bonzini


On 06/08/2015 19:13, Nicholas Krause wrote:
> diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
> index 6249cdc..5a18859 100644
> --- a/arch/powerpc/kvm/mpic.c
> +++ b/arch/powerpc/kvm/mpic.c
> @@ -1641,13 +1641,16 @@ static void mpic_destroy(struct kvm_device *dev)
>  static int mpic_set_default_irq_routing(struct openpic *opp)
>  {
>   struct kvm_irq_routing_entry *routing;
> + int ret;
>  
>   /* Create a nop default map, so that dereferencing it still works */
>   routing = kzalloc((sizeof(*routing)), GFP_KERNEL);
>   if (!routing)
>   return -ENOMEM;
>  
> - kvm_set_irq_routing(opp->kvm, routing, 0, 0);
> + ret = kvm_set_irq_routing(opp->kvm, routing, 0, 0);
> + if (ret)
> + return ret;
>  
>   kfree(routing);
>   return 0;

The patch leaks the "routing" variable if you hit the error path.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] kvm:powerpc:Fix error handling in the function mpic_set_default_irq_routing

2015-08-07 Thread Paolo Bonzini


On 07/08/2015 15:47, Nicholas Krause wrote:
> - kvm_set_irq_routing(opp->kvm, routing, 0, 0);
> + ret = kvm_set_irq_routing(opp->kvm, routing, 0, 0);
> + if (ret) {
> + kfree(routing);
> + return ret;
> + }
>  
>   kfree(routing);
>   return 0;

You could just return ret here.  The new "if" is not necessary.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2] ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flag

2013-08-26 Thread Paolo Bonzini
Il 25/08/2013 17:04, Alexander Graf ha scritto:
> 
> On 24.08.2013, at 21:14, Yann Droneaud wrote:
> 
>> KVM uses anon_inode_get() to allocate file descriptors as part
>> of some of its ioctls. But those ioctls are lacking a flag argument
>> allowing userspace to choose options for the newly opened file descriptor.
>>
>> In such case it's advised to use O_CLOEXEC by default so that
>> userspace is allowed to choose, without race, if the file descriptor
>> is going to be inherited across exec().
>>
>> This patch set O_CLOEXEC flag on all file descriptors created
>> with anon_inode_getfd() to not leak file descriptors across exec().
>>
>> Signed-off-by: Yann Droneaud 
>> Link: http://lkml.kernel.org/r/cover.1377372576.git.ydrone...@opteya.com
> 
> Reviewed-by: Alexander Graf 
> 
> Would it make sense to simply inherit the O_CLOEXEC flag from the
> parent kvm fd instead? That would give user space the power to keep
> fds across exec() if it wants to.

Does it make sense to use non-O_CLOEXEC file descriptors with KVM at
all?  Besides fork() not being supported by KVM, as described in
Documentation/virtual/kvm/api.txt, the VMAs of the parent process go
away as soon as you exec().  I'm not sure how you can use the inherited
file descriptor in a sensible way after exec().

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2] ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flag

2013-08-26 Thread Paolo Bonzini
Il 26/08/2013 10:23, Yann Droneaud ha scritto:
> 
> Sounds a lot like InfiniBand subsystem behavor: IB file descriptors
> are of no use accross exec() since memory mappings tied to those fds
> won't be available in the new process:
> 
> https://lkml.org/lkml/2013/7/8/380
> http://mid.gmane.org/f58540dc64fec1ac0e496dfcd3cc1...@meuh.org

Yes, it is very similar.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-09-26 Thread Paolo Bonzini
Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> Some powernv systems include a hwrng. Guests can access it via the
> H_RANDOM hcall.
> 
> We add a real mode implementation of H_RANDOM when a hwrng is found.
> Userspace can detect the presence of the hwrng by quering the
> KVM_CAP_PPC_HWRNG capability.
> 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/archrandom.h   |  11 ++-
>  arch/powerpc/include/asm/kvm_ppc.h  |   2 +
>  arch/powerpc/kvm/book3s_hv_builtin.c|  15 
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 119 
> 
>  arch/powerpc/kvm/powerpc.c  |   3 +
>  arch/powerpc/platforms/powernv/rng.c|  25 +++
>  include/uapi/linux/kvm.h|   1 +
>  7 files changed, 174 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/archrandom.h 
> b/arch/powerpc/include/asm/archrandom.h
> index d853d16..86296d5 100644
> --- a/arch/powerpc/include/asm/archrandom.h
> +++ b/arch/powerpc/include/asm/archrandom.h
> @@ -25,8 +25,15 @@ static inline int arch_get_random_int(unsigned int *v)
>   return rc;
>  }
>  
> -int powernv_get_random_long(unsigned long *v);
> -
>  #endif /* CONFIG_ARCH_RANDOM */
>  
> +#ifdef CONFIG_PPC_POWERNV
> +int powernv_hwrng_present(void);
> +int powernv_get_random_long(unsigned long *v);
> +int powernv_get_random_real_mode(unsigned long *v);
> +#else
> +static inline int powernv_hwrng_present(void) { return 0; }
> +static inline int powernv_get_random_real_mode(unsigned long *v) { return 0; 
> }
> +#endif
> +
>  #endif /* _ASM_POWERPC_ARCHRANDOM_H */
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index b15554a..51966fd 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -177,6 +177,8 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, 
> u32 *server,
>  extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq);
>  extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq);
>  
> +extern int kvmppc_hwrng_present(void);
> +
>  /*
>   * Cuts out inst bits with ordering according to spec.
>   * That means the leftmost bit is zero. All given bits are included.
> diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
> b/arch/powerpc/kvm/book3s_hv_builtin.c
> index 8cd0dae..de12592 100644
> --- a/arch/powerpc/kvm/book3s_hv_builtin.c
> +++ b/arch/powerpc/kvm/book3s_hv_builtin.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "book3s_hv_cma.h"
>  /*
> @@ -181,3 +182,17 @@ void __init kvm_cma_reserve(void)
>   kvm_cma_declare_contiguous(selected_size, align_size);
>   }
>  }
> +
> +int kvmppc_hwrng_present(void)
> +{
> + return powernv_hwrng_present();
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_hwrng_present);
> +
> +long kvmppc_h_random(struct kvm_vcpu *vcpu)
> +{
> + if (powernv_get_random_real_mode(&vcpu->arch.gpr[4]))
> + return H_SUCCESS;
> +
> + return H_HARDWARE;
> +}
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 294b7af..35ce59e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1502,6 +1502,125 @@ hcall_real_table:
>   .long   0   /* 0x11c */
>   .long   0   /* 0x120 */
>   .long   .kvmppc_h_bulk_remove - hcall_real_table
> + .long   0   /* 0x128 */
> + .long   0   /* 0x12c */
> + .long   0   /* 0x130 */
> + .long   0   /* 0x134 */
> + .long   0   /* 0x138 */
> + .long   0   /* 0x13c */
> + .long   0   /* 0x140 */
> + .long   0   /* 0x144 */
> + .long   0   /* 0x148 */
> + .long   0   /* 0x14c */
> + .long   0   /* 0x150 */
> + .long   0   /* 0x154 */
> + .long   0   /* 0x158 */
> + .long   0   /* 0x15c */
> + .long   0   /* 0x160 */
> + .long   0   /* 0x164 */
> + .long   0   /* 0x168 */
> + .long   0   /* 0x16c */
> + .long   0   /* 0x170 */
> + .long   0   /* 0x174 */
> + .long   0   /* 0x178 */
> + .long   0   /* 0x17c */
> + .long   0   /* 0x180 */
> + .long   0   /* 0x184 */
> + .long   0   /* 0x188 */
> + .long   0   /* 0x18c */
> + .long   0   /* 0x190 */
> + .long   0   /* 0x194 */
> + .long   0   /* 0x198 */
> + .long   0   /* 0x19c */
> + .long   0   /* 0x1a0 */
> + .long   0   /* 0x1a4 */
> + .long   0   /* 0x1a8 */
> + .long   0   /* 0x1ac */
> + .long   0   /* 0x1b0 */
> + .long   0 

Re: [RFC PATCH 09/11] kvm: simplify processor compat check

2013-09-27 Thread Paolo Bonzini
Il 27/09/2013 15:13, Aneesh Kumar K.V ha scritto:
> Alexander Graf  writes:
> 
>> On 27.09.2013, at 12:03, Aneesh Kumar K.V wrote:
>>
>>> From: "Aneesh Kumar K.V" 
>>
>> Missing patch description.
>>
>>> Signed-off-by: Aneesh Kumar K.V 
>>
>> I fail to see how this really simplifies things, but at the end of the
>> day it's Gleb's and Paolo's call.
> 
> will do. It avoid calling 
> 
>   for_each_online_cpu(cpu) {
>   smp_call_function_single() 
> 
> on multiple architecture.

I agree with Alex.

The current code is not specially awesome; having
kvm_arch_check_processor_compat take an int* disguised as a void* is a
bit ugly indeed.

However, the API makes sense and tells you that it is being passed as a
callback (to smp_call_function_single in this case).

You are making the API more complicated to use on the arch layer
(because arch maintainers now have to think "do I need to check this on
all online CPUs?") and making the "leaf" POWER code less legible because
it still has the weird void()(void *) calling convention.

If anything, you could change kvm_arch_check_processor_compat to return
an int and accept no argument, and introduce a wrapper that kvm_init
passes to smp_call_function_single.

Paolo

> We also want to make the smp call function a callback of opaque. Hence
> this should be made arch specific. 
> 
> int kvm_arch_check_processor_compat(void *opaque)
> {
>   int r,cpu;
>   struct kvmppc_ops *kvm_ops = (struct kvmppc_ops *)opaque;
>   for_each_online_cpu(cpu) {
>   smp_call_function_single(cpu,
>kvm_ops->check_processor_compat,
>&r, 1);
>   if (r < 0)
>   break;
>   }
>   return r;
> }
> 
> against
> 
> - for_each_online_cpu(cpu) {
> - smp_call_function_single(cpu,
> - kvm_arch_check_processor_compat,
> - &r, 1);
> - if (r < 0)
> - goto out_free_1;
> - }
> +
> + r = kvm_arch_check_processor_compat(opaque);
> + if (r < 0)
> + goto out_free_1;
> 
> 
> 
>>
>> Which brings me to the next issue: You forgot to CC kvm@vger on your
>> patch set. Gleb and Paolo don't read kvm-ppc@vger. And they shouldn't
>> have to. Every kvm patch that you want review on or that should get
>> applied needs to be sent to kvm@vger. If you want to tag it as PPC
>> specific patch, do so by CC'ing kvm-ppc@vger.
> 
> Will do in the next update
> 
> -aneesh
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-01 Thread Paolo Bonzini
Il 01/10/2013 10:34, Michael Ellerman ha scritto:
>> If you really want to have the hypercall, implementing it in QEMU means
>> that you can support it on all systems, in fact even when running
>> without KVM.  
> 
> Sure, I can add a fallback to /dev/hwrng for full emulation.
> 
>> The QEMU command line would be something like "-object
>> rng-random,filename=/dev/random,id=rng0 -device spapr-rng,rng=rng0".
> 
> We can't use /dev/random like that. The PAPR specification, which is
> what we're implementing, implies that H_RANDOM provides data from a
> hardware source.

Then use /dev/hwrng.

I don't have POWER machines, but I still want to be able to test as much
as possible using emulation.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-01 Thread Paolo Bonzini
Il 01/10/2013 11:38, Benjamin Herrenschmidt ha scritto:
> So for the sake of that dogma you are going to make us do something that
> is about 100 times slower ? (and possibly involves more lines of code)

If it's 100 times slower there is something else that's wrong.  It's
most likely not 100 times slower, and this makes me wonder if you or
Michael actually timed the code at all.

> It's not just speed ... H_RANDOM is going to be called by the guest
> kernel. A round trip to qemu is going to introduce a kernel jitter
> (complete stop of operations of the kernel on that virtual processor) of
> a full exit + round trip to qemu + back to the kernel to get to some
> source of random number ...  this is going to be in the dozens of ns at
> least.

I guess you mean dozens of *micro*seconds, which is somewhat exaggerated
but not too much.  On x86 some reasonable timings are:

  100 cyclesbare metal rdrand
  2000 cycles   guest->hypervisor->guest
  15000 cycles  guest->userspace->guest

(100 cycles = 40 ns = 200 MB/sec; 2000 cycles = ~1 microseconds; 15000
cycles = ~7.5 microseconds).  Even on 5 year old hardware, a userspace
roundtrip is around a dozen microseconds.


Anyhow, I would like to know more about this hwrng and hypercall.

Does the hwrng return random numbers (like rdrand) or real entropy (like
rdseed that Intel will add in Broadwell)?  What about the hypercall?
For example virtio-rng is specified to return actual entropy, it doesn't
matter if it is from hardware or software.

In either case, the patches have problems.

1) If the hwrng returns random numbers, the whitening you're doing is
totally insufficient and patch 2 is forging entropy that doesn't exist.

2) If the hwrng returns entropy, a read from the hwrng is going to even
more expensive than an x86 rdrand (perhaps ~2000 cycles).  Hence, doing
the emulation in the kernel is even less necessary.  Also, if the hwrng
returns entropy patch 1 is unnecessary: you do not need to waste
precious entropy bits by passing them to arch_get_random_long; just run
rngd in the host as that will put the entropy to much better use.

3) If the hypercall returns random numbers, then it is a pretty
braindead interface since returning 8 bytes at a time limits the
throughput to a handful of MB/s (compare to 200 MB/sec for x86 rdrand).
 But more important: in this case drivers/char/hw_random/pseries-rng.c
is completely broken and insecure, just like patch 2 in case (1) above.

4) If the hypercall returns entropy (same as virtio-rng), the same
considerations on speed apply.  If you can only produce entropy at say 1
MB/s (so reading 8 bytes take 8 microseconds---which is actually very
fast), it doesn't matter that much to spend 7 microseconds on a
userspace roundtrip.  It's going to be only half the speed of bare
metal, not 100 times slower.


Also, you will need _anyway_ extra code that is not present here to
either disable the rng based on userspace command-line, or to emulate
the rng from userspace.  It is absolutely _not_ acceptable to have a
hypercall disappear across migration.  You're repeatedly ignoring these
issues, but rest assured that they will come back and bite you
spectacularly.

Based on all this, I would simply ignore the part of the spec where they
say "the hypercall should return numbers from a hardware source".  All
that matters in virtualization is to have a good source of _entropy_.
Then you can run rngd without randomness checks, which will more than
recover the cost of userspace roundtrips.

In any case, deciding where to get that entropy from is definitely
outside the scope of KVM, and in fact QEMU already has a configurable
mechanism for that.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 00/11 Allow PR and HV KVM to coexist in one kernel

2013-10-01 Thread Paolo Bonzini
Il 01/10/2013 13:36, Alexander Graf ha scritto:

>>> Yes, so we should verify in the machine models that we're runnable with
>>> the currently selected type at least, to give the user a sensible error
>>> message.
>> Something like the below
> 
> I like that one a lot. Andreas, Paolo, what do you think?

Yes, it's fine.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Paolo Bonzini
Il 01/10/2013 23:44, Benjamin Herrenschmidt ha scritto:
> On Tue, 2013-10-01 at 13:19 +0200, Paolo Bonzini wrote:
>> Il 01/10/2013 11:38, Benjamin Herrenschmidt ha scritto:
>>> So for the sake of that dogma you are going to make us do something that
>>> is about 100 times slower ? (and possibly involves more lines of code)
>>
>> If it's 100 times slower there is something else that's wrong.  It's
>> most likely not 100 times slower, and this makes me wonder if you or
>> Michael actually timed the code at all.
> 
> So no we haven't measured. But it is going to be VERY VERY VERY much
> slower. Our exit latencies are bad with our current MMU *and* any exit
> is going to cause all secondary threads on the core to have to exit as
> well (remember P7 is 4 threads, P8 is 8)

Ok, this is indeed the main difference between Power and x86.

>>   100 cyclesbare metal rdrand
>>   2000 cycles   guest->hypervisor->guest
>>   15000 cycles  guest->userspace->guest
>>
>> (100 cycles = 40 ns = 200 MB/sec; 2000 cycles = ~1 microseconds; 15000
>> cycles = ~7.5 microseconds).  Even on 5 year old hardware, a userspace
>> roundtrip is around a dozen microseconds.
> 
> So in your case going to qemu to "emulate" rdrand would indeed be 150
> times slower, I don't see in what universe that would be considered a
> good idea.

rdrand is not privileged on x86, guests can use it.  But my point is
that going to the kernel is already 20 times slower.  Getting entropy
(not just a pseudo-random number seeded by the HWRNG) with rdrand is
~1000 times slower according to Intel's recommendations, so the
roundtrip to userspace is entirely invisible in that case.

The numbers for PPC seem to be a bit different though (it's faster to
read entropy, and slower to do a userspace exit).

> It's a random number obtained from sampling a set of oscillators. It's
> slightly biased but we have very simple code (I believe shared with the
> host kernel implementation) for whitening it as is required by PAPR.

Good.  Actually, passing the dieharder tests does not mean much (an
AES-encrypted counter should also pass them with flashing colors), but
if it's specified by the architecture gods it's likely to have received
some scrutiny.

>> 2) If the hwrng returns entropy, a read from the hwrng is going to even
>> more expensive than an x86 rdrand (perhaps ~2000 cycles).
> 
> Depends how often you read, the HW I think is sampling asynchronously so
> you only block on the MMIO if you already consumed the previous sample
> but I'll let Paulus provide more details here.

Given Paul's description, there's indeed very little extra cost compared
to a "nop" hypercall.  That's nice.

Still, considering that QEMU code has to be there anyway for
compatibility, kernel emulation is not particularly necessary IMHO.  I
would of course like to see actual performance numbers, but besides that
are you ever going to ever see this in the profile except if you run "dd
if=/dev/hwrng of=/dev/null"?

Can you instrument pHyp to find out how many times per second is this
hypercall called by a "normal" Linux or AIX guest?

>> 3) If the hypercall returns random numbers, then it is a pretty
>> braindead interface since returning 8 bytes at a time limits the
>> throughput to a handful of MB/s (compare to 200 MB/sec for x86 rdrand).
>>  But more important: in this case drivers/char/hw_random/pseries-rng.c
>> is completely broken and insecure, just like patch 2 in case (1) above.
> 
> How so ?

Paul confirmed that it returns real entropy so this is moot.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Paolo Bonzini
Il 02/10/2013 07:09, Paul Mackerras ha scritto:
> On Tue, Oct 01, 2013 at 01:19:06PM +0200, Paolo Bonzini wrote:
> 
>> Anyhow, I would like to know more about this hwrng and hypercall.
>>
>> Does the hwrng return random numbers (like rdrand) or real entropy (like
>> rdseed that Intel will add in Broadwell)?  What about the hypercall?
> 
> Well, firstly, your terminology is inaccurate.  Real entropy will give
> you random numbers.  I think when you say "random numbers" you
> actually mean "pseudo-random numbers".

Yes---I meant pseudo-random numbers where the generator is periodically
seeded by a random number.

> Secondly, the RNG produces real entropy.

Good to know, thanks.

> Not sure why they are particularly "precious"; we get 64 bits per
> microsecond whether we use them or not.  What are you suggesting
> arch_get_random_long() should do instead?

If you are running rngd, there is no need to have arch_get_random_long()
at all.

>> 3) If the hypercall returns random numbers, then it is a pretty
>> braindead interface since returning 8 bytes at a time limits the
>> throughput to a handful of MB/s (compare to 200 MB/sec for x86 rdrand).
>>  But more important: in this case drivers/char/hw_random/pseries-rng.c
>> is completely broken and insecure, just like patch 2 in case (1) above.
> 
> Assuming that by "random numbers" you actually mean "pseudo-random
> numbers", then this doesn't apply.

Indeed.

>> 4) If the hypercall returns entropy (same as virtio-rng), the same
>> considerations on speed apply.  If you can only produce entropy at say 1
>> MB/s (so reading 8 bytes take 8 microseconds---which is actually very
>> fast), it doesn't matter that much to spend 7 microseconds on a
>> userspace roundtrip.  It's going to be only half the speed of bare
>> metal, not 100 times slower.
> 
> 8 bytes takes at most 1 microsecond, so the round-trip to userspace is
> definitely noticeable.

Thanks.  Any chance you can give some numbers of a kernel hypercall and
a userspace hypercall on Power, so we have actual data?  For example a
hypercall that returns H_PARAMETER as soon as possible.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Paolo Bonzini
Il 02/10/2013 16:08, Alexander Graf ha scritto:
> > The hwrng is accessible by host userspace via /dev/mem.
> 
> A guest should live on the same permission level as a user space
> application. If you run QEMU as UID 1000 without access to /dev/mem, why
> should the guest suddenly be able to directly access a memory location
> (MMIO) it couldn't access directly through a normal user space interface.
> 
> It's basically a layering violation.

With Michael's earlier patch in this series, the hwrng is accessible by
host userspace via /dev/hwrng, no?

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Paolo Bonzini
Il 02/10/2013 16:36, Alexander Graf ha scritto:
>> > 
>> > With Michael's earlier patch in this series, the hwrng is accessible by
>> > host userspace via /dev/hwrng, no?
> Yes, but there's not token from user space that gets passed into the
> kernel to check whether access is ok or not. So while QEMU may not have
> permission to open /dev/hwrng it could spawn a guest that opens it,
> drains all entropy out of it and thus stall other processes which try to
> fetch entropy, no?
> 
> Maybe I haven't fully grasped the interface yet though :).

Yes, that's right.  I don't think it's a huge problem, but it's another
point in favor of just doing the hypercall in userspace.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 3/3] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-03 Thread Paolo Bonzini

Il 03/06/2014 09:02, Michal Nazarewicz ha scritto:

On Tue, Jun 03 2014, Joonsoo Kim wrote:

Now, we have general CMA reserved area management framework,
so use it for future maintainabilty. There is no functional change.

Signed-off-by: Joonsoo Kim 


Acked-by: Michal Nazarewicz 



Acked-by: Paolo Bonzini 

Aneesh, can you test this series?

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] Revert "KVM: PPC: Book3S HV: Add new state for transactional memory"

2014-03-10 Thread Paolo Bonzini

Il 10/03/2014 11:50, Paul Mackerras ha scritto:

We can either do this revert, or apply a patch removing the extra
hunk, but one or the other should go in for 3.14 since it's quite
broken as it is (that is, HV-mode KVM on powerpc is broken).

Paolo, do you have a preference about revert vs. fix?  Are you happy
to take what Aneesh sent (in which case please add my acked-by and
perhaps edit the commentary to say how the problem arose), or do you
want a freshly-prepared patch, and if so against which branch?


I prefer a fix.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-20 Thread Paolo Bonzini


On 20/02/2015 14:45, Alexander Graf wrote:
> 
> 
> On 18.02.15 10:32, Bogdan Purcareata wrote:
>> This patchset enables running KVM SMP guests with external interrupts on an
>> underlying RT-enabled Linux. Previous to this patch, a guest with in-kernel 
>> MPIC
>> emulation could easily panic the kernel due to preemption when delivering 
>> IPIs
>> and external interrupts, because of the openpic spinlock becoming a sleeping
>> mutex on PREEMPT_RT_FULL Linux.
>>
>> 0001: converts the openpic spinlock to a raw spinlock, in order to circumvent
>> this behavior. While this change is targeted for a RT enabled Linux, it has 
>> no
>> effect on upstream kvm-ppc, so send it upstream for better future 
>> maintenance.
>>
>> 0002: introduces a limit on the maximum VCPUs a guest can have, in order to
>> prevent potential DoS attack due to large system latencies. This patch is
>> targeted to RT (due to CONFIG_PREEMPT_RT_FULL), but it can also be applied on
>> upstream Linux, with no effect. Not sure if it's best to send it upstream and
>> have a hanging CONFIG_PREEMPT_RT_FULL check there, with no effect, or send it
>> against linux-stable-rt. Please apply as you consider appropriate.
> 
> Thomas, what is the usual approach for patches like this? Do you take
> them into your rt tree or should they get integrated to upstream?

Patch 1 is definitely suitable for upstream, that's the reason why we
have raw_spin_lock vs. raw_spin_unlock.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-20 Thread Paolo Bonzini


On 20/02/2015 15:54, Sebastian Andrzej Siewior wrote:
> Usually you see "scheduling while atomic" on -RT and convert them to
> raw locks if it is appropriate.
> 
> Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
> not cause a DoS and large latencies in the host. I haven't seen an
> answer to my why question. Because if the conversation leads to
> large latencies in the host then it does not look right.
> 
> Each host PIC has a rawlock and does mostly just mask/unmask and the
> raw lock makes sure the value written is not mixed up due to
> preemption.
> This hardly increase latencies because the "locked" path is very short.
> If this conversation leads to higher latencies then the locked path is
> too long and hardly suitable to become a rawlock.

Yes, but large latencies just mean the code has to be rewritten (x86
doesn't anymore do event injection in an atomic regions for example).
Until it is, using raw_spin_lock is correct.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-20 Thread Paolo Bonzini


On 20/02/2015 16:06, Sebastian Andrzej Siewior wrote:
> On 02/20/2015 03:57 PM, Paolo Bonzini wrote:

>> Yes, but large latencies just mean the code has to be rewritten (x86
>> doesn't anymore do event injection in an atomic regions for example).
>> Until it is, using raw_spin_lock is correct.
> 
> It does not sound like it. It sounds more like disabling interrupts to
> get things run faster and then limit it on a different corner to not
> blow up everything.

"This patchset enables running KVM SMP guests with external interrupts
on an underlying RT-enabled Linux. Previous to this patch, a guest with
in-kernel MPIC emulation could easily panic the kernel due to preemption
when delivering IPIs and external interrupts, because of the openpic
spinlock becoming a sleeping mutex on PREEMPT_RT_FULL Linux".

> Max latencies was decreased "Max latency (us)  7062" and that
> is why this is done? For 8 us and possible DoS in case there are too
> many cpus?

My understanding is that:

1) netperf can get you a BUG KVM, and raw_spinlock fixes that

2) cyclictest did not trigger the BUG, and you can also get reduced
latency from using raw_spinlock.

I think we agree that (2) is not a factor in accepting the patch.

Paolo

>> Paolo
>>
> 
> Sebastian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-26 Thread Paolo Bonzini


On 24/02/2015 00:27, Scott Wood wrote:
> This isn't a host PIC driver.  It's guest PIC emulation, some of which
> is indeed not suitable for a rawlock (in particular, openpic_update_irq
> which loops on the number of vcpus, with a loop body that calls
> IRQ_check() which loops over all pending IRQs).

The question is what behavior is wanted of code that isn't quite
RT-ready.  What is preferred, bugs or bad latency?

If the answer is bad latency (which can be avoided simply by not running
KVM on a RT kernel in production), patch 1 can be applied.  If the
answer is bugs, patch 1 is not upstream material.

I myself prefer to have bad latency; if something takes a spinlock in
atomic context, that spinlock should be raw.  If it hurts (latency),
don't do it (use the affected code).

Paolo

> The vcpu limits are a
> temporary bandaid to avoid the worst latencies, but I'm still skeptical
> about this being upstream material.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-27 Thread Paolo Bonzini


On 27/02/2015 02:05, Scott Wood wrote:
> Obviously leaving it in a buggy state is not what we want -- but I lean
> towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
> in-kernel MPIC emulation (which is itself just an optimization -- you
> can still use KVM without it).  This way people don't enable it with RT
> without being aware of the issue, and there's more of an incentive to
> fix it properly.

That would indeed work for me.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/7] KVM/X86/PPC: Clear up kvm mmu memory barriers and update related comments

2016-03-14 Thread Paolo Bonzini


On 13/03/2016 04:10, Lan Tianyu wrote:
> This series is to clear up kvm mmu memory barriers.
> 1) Remove redundant barrier (PATCH 1)
> 2) Replace origin barrier functions with preferrable ones (PATCH 2, 3, 5)
> 3) Fix unpaired barriers (PATCH 4)
> 4) Update or add barrier related comments (PATCH 6, 7)

Thanks, this looks pretty good!  I will apply it for 4.6 if I have to
send two pull requests during this merge window; otherwise, it will have
to wait for the next merge window.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel] KVM: PPC: Create a virtual-mode only TCE table handlers

2016-03-18 Thread Paolo Bonzini


On 18/03/2016 10:16, Paul Mackerras wrote:
> On Fri, Mar 18, 2016 at 01:50:42PM +1100, Alexey Kardashevskiy wrote:
>> Upcoming in-kernel VFIO acceleration needs different handling in real
>> and virtual modes which makes it hard to support both modes in
>> the same handler.
>>
>> This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
>> in addition to the existing kvmppc_rm_h_put_tce_indirect.
>>
>> This also fixes linker breakage when only PR KVM was selected (leaving
>> HV KVM off): the kvmppc_h_put_tce/kvmppc_h_stuff_tce functions
>> would not compile at all and the linked would fail.
>>
>> Signed-off-by: Alexey Kardashevskiy 
> 
> Acked-by: Paul Mackerras 
> 
> Paolo, will you take this directly, or do you want me to generate
> a pull request?

I can take it directly.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: powerpc: kvmppc_visible_gpa can be boolean

2015-11-18 Thread Paolo Bonzini


On 16/11/2015 04:10, Yaowei Bai wrote:
> In another patch kvm_is_visible_gfn is maken return bool due to this
> function only returns zero or one as its return value, let's also make
> kvmppc_visible_gpa return bool to keep consistent.
> 
> No functional change.
> 
> Signed-off-by: Yaowei Bai 
> ---
>  arch/powerpc/kvm/book3s_pr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 64891b0..70fb08d 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -512,7 +512,7 @@ static void kvmppc_patch_dcbz(struct kvm_vcpu *vcpu, 
> struct kvmppc_pte *pte)
>   put_page(hpage);
>  }
>  
> -static int kvmppc_visible_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
> +static bool kvmppc_visible_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>   ulong mp_pa = vcpu->arch.magic_page_pa;
>  
> @@ -521,7 +521,7 @@ static int kvmppc_visible_gpa(struct kvm_vcpu *vcpu, 
> gpa_t gpa)
>  
>   gpa &= ~0xFFFULL;
>   if (unlikely(mp_pa) && unlikely((mp_pa & KVM_PAM) == (gpa & KVM_PAM))) {
> - return 1;
> + return true;
>   }
>  
>   return kvm_is_visible_gfn(vcpu->kvm, gpa >> PAGE_SHIFT);
> 

Applied all three patches to kvm/queue, thanks.

Paolo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 11/16] KVM: introduce a 'mmap' method for KVM devices

2019-02-26 Thread Paolo Bonzini
On 25/02/19 11:57, Cédric Le Goater wrote:
> Hello Paolo,
> 
> On 2/25/19 4:33 AM, David Gibson wrote:
>> On Fri, Feb 22, 2019 at 12:28:35PM +0100, Cédric Le Goater wrote:
>>> Some KVM devices will want to handle special mappings related to the
>>> underlying HW. For instance, the XIVE interrupt controller of the
>>> POWER9 processor has MMIO pages for thread interrupt management and
>>> for interrupt source control that need to be exposed to the guest when
>>> the OS has the required support.
>>>
>>> Signed-off-by: Cédric Le Goater 
>>
>> Ah, when I suggested mmap() on the base device fd, I hadn't realized
>> there wasn't a facility for that yet.
>>
>> Have you discussed this with Paolo?  
> 
> Not yet.
> 
>> We'll need some core KVM buy in to merge this.
> 
> Here is an extension of the KVM device to allow special mappings.
> Something we would need for the support of the POWER9 XIVE interrupt 
> controller.
> 
> There are two MMIOs we need to expose to the guest : 
> 
>  1. HW MMIO controlling of the interrupt presenter registers (TIMA)
>  2. HW MMIO of the interrupt sources for interrupt management (ESB)
> 
> The TIMA could have been exposed with a page offset in the vCPU mapping
> but as it only makes sense when the XIVE interrupt mode is active, we 
> chose to use directly the KVM device fd for that. Is that ok ? 
> 
> An alternate solution is to use a device ioctl to allocate an anon fd
> and do the mapping, but that seems like extra fuss for the same result. 

It's okay, it's a natural extension to dev_ops - but thanks for asking
anyway. :)

Paolo


Re: [PATCH v2 15/16] KVM: introduce a KVM_DESTROY_DEVICE ioctl

2019-03-15 Thread Paolo Bonzini
On 13/03/19 09:02, Cédric Le Goater wrote:
> The 'destroy' method is currently used to destroy all devices when the
> VM is destroyed after the vCPUs have been freed.
> 
> This new KVM ioctl exposes the same KVM device method. It acts as a
> software reset of the VM to 'destroy' selected devices when necessary
> and perform the required cleanups on the vCPUs. Called with the
> kvm->lock.
> 
> The 'destroy' method could be improved by returning an error code.
> 
> Signed-off-by: Cédric Le Goater 

Looks good to me.

Paolo


Re: [GIT PULL] Second batch of KVM changes for Linux 5.6-rc4 (or rc5)

2020-03-01 Thread Paolo Bonzini
On 01/03/20 22:33, Linus Torvalds wrote:
> On Sun, Mar 1, 2020 at 1:03 PM Paolo Bonzini  wrote:
>>
>> Paolo Bonzini (4):
>>   KVM: allow disabling -Werror
> 
> Honestly, this is just badly done.
> 
> You've basically made it enable -Werror only for very random
> configurations - and apparently the one you test.
> Doing things like COMPILE_TEST disables it, but so does not having
> EXPERT enabled.

Yes, I took this from the i915 Kconfig.  It's temporary, in 5.7 I am
planning to get it to just !KASAN, but for 5.6 I wanted to avoid more
breakage so I added the other restrictions.  The difference between
x86-64 and i386 is really just the frame size warnings, which Christoph
triggered because of a higher CONFIG_NR_CPUS.

(BTW, perhaps it makes sense for Sparse to have something like __nostack
for structs that contain potentially large arrays).

> I've merged this, but I wonder why you couldn't just do what I
> suggested originally?  Seriously, if you script your build tests,
> and don't even look at the results, then you might as well use
> 
>make KCFLAGS=-Werror

I did that and I'm also adding W=1; and I threw in a smaller than
default frame size warning option too because I don't want cpumasks on
the stack anyway.  However, that wouldn't help contributors.  I'm okay
if I get W=1 or frame size warnings from patches from other
contributors, but I think it's a disservice to them that they have to
set KCFLAGS in order to avoid warnings.

> the "now it causes problems for
> random compiler versions" is a real issue again - but at least it
> wouldn't be a random kernel subsystem that happens to trigger it, it
> would be a _generic_ issue, and we'd have everybody involved when a
> compiler change introduces a new warning.

Yes, and GCC prereleases are tested with Linux, for example by doing
full Rawhide rebuilds.  If we started using -Werror by default
(including allyesconfig), they would probably report warnings early.
Same for clang.

I hope that Linux can have -Werror everywhere, or at least a
CONFIG_WERROR option that does it even if it defaults to n for a release
or more.  But I don't think we can get there without first seeing what
issues pop up in a few subsystems or arches---even before considering
new compilers---so I decided I would just try.

Paolo

> Adding the powerpc people, since they have more history with their
> somewhat less hacky one. Except that one automatically gets disabled
> by "make allmodconfig" and friends, which is also kind of pointless.

> Michael, what tends to be the triggers for people using
> PPC_DISABLE_WERROR? Do you have reports for it? Could we have a
> _generic_ option that just gets enabled by default, except it gets
> disabled by _known_ issues (like KASAN).
> 
> Being disabled for "make allmodconfig" is kind of against one of the
> _points_ of "the build should be warning-free".



Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-20 Thread Paolo Bonzini
On 20/03/20 09:55, Davidlohr Bueso wrote:
> Only compiled and tested on x86.

It shows :) as the __KVM_HAVE_ARCH_WQP case is broken.  But no problem, 
Paul and I can pick this up and fix it.

This is missing:

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 506e4df2d730..6e5d85ba588d 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -78,7 +78,7 @@ struct kvmppc_vcore {
struct kvm_vcpu *runnable_threads[MAX_SMT_THREADS];
struct list_head preempt_list;
spinlock_t lock;
-   struct swait_queue_head wq;
+   struct rcuwait wait;
spinlock_t stoltb_lock; /* protects stolen_tb and preempt_tb */
u64 stolen_tb;
u64 preempt_tb;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1af96fb5dc6f..8c8122c30b89 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -754,7 +754,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
if (err)
goto out_vcpu_uninit;
 
-   vcpu->arch.wqp = &vcpu->wq;
+   vcpu->arch.waitp = &vcpu->wait;
kvmppc_create_vcpu_debugfs(vcpu, vcpu->vcpu_id);
return 0;
 

and...

> -static inline struct swait_queue_head *kvm_arch_vcpu_wq(struct kvm_vcpu 
> *vcpu)
> +static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
>  {
>  #ifdef __KVM_HAVE_ARCH_WQP
> - return vcpu->arch.wqp;
> + return vcpu->arch.wait;

... this needs to be vcpu->arch.waitp.  That should be it.

Thanks!

Paolo

>  #else
> - return &vcpu->wq;
> + return &vcpu->wait;
>  #endif
>  }
>  
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 0d9438e9de2a..4be71cb58691 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -593,7 +593,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>   if (map.emul_ptimer)
>   soft_timer_cancel(&map.emul_ptimer->hrtimer);
>  
> - if (swait_active(kvm_arch_vcpu_wq(vcpu)))
> + if (rcu_dereference(kvm_arch_vpu_get_wait(vcpu)) != NULL)
>   kvm_timer_blocking(vcpu);
>  
>   /*
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index eda7b624eab8..4a704866e9b6 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -579,16 +579,17 @@ void kvm_arm_resume_guest(struct kvm *kvm)
>  
>   kvm_for_each_vcpu(i, vcpu, kvm) {
>   vcpu->arch.pause = false;
> - swake_up_one(kvm_arch_vcpu_wq(vcpu));
> + rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
>   }
>  }
>  
>  static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>  {
> - struct swait_queue_head *wq = kvm_arch_vcpu_wq(vcpu);
> + struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
>  
> - swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) &&
> -(!vcpu->arch.pause)));
> + rcuwait_wait_event(*wait,
> +(!vcpu->arch.power_off) && (!vcpu->arch.pause),
> +TASK_INTERRUPTIBLE);
>  
>   if (vcpu->arch.power_off || vcpu->arch.pause) {
>   /* Awaken to handle a signal, request we sleep again later. */
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 15e5b037f92d..10b533f641a6 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,8 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>   trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>  
> - if (swq_has_sleeper(&vcpu->wq))
> - swake_up_one(&vcpu->wq);
> + rcuwait_wake_up(&vcpu->wait);
>  
>   mmput(mm);
>   kvm_put_kvm(vcpu->kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70f03ce0e5c1..6b49dcb321e2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -343,7 +343,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct 
> kvm *kvm, unsigned id)
>   vcpu->kvm = kvm;
>   vcpu->vcpu_id = id;
>   vcpu->pid = NULL;
> - init_swait_queue_head(&vcpu->wq);
> + rcuwait_init(&vcpu->wait);
>   kvm_async_pf_vcpu_init(vcpu);
>  
>   vcpu->pre_pcpu = -1;
> @@ -2465,9 +2465,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  {
>   ktime_t start, cur;
> - DECLARE_SWAITQUEUE(wait);
> - bool waited = false;
>   u64 block_ns;
> + int block_check = -EINTR;
>  
>   kvm_arch_vcpu_blocking(vcpu);
>  
> @@ -2487,21 +2486,14 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>   ++vcpu->stat.halt_poll_invalid;
>   goto out;
>   }
> +
>   cur = ktime_get();
>   } while (single_task_running() && ktime_before(cur, stop));
>   }
>  
> - for (;;) {
> - prepare_to_swait_exclusive(&vcpu->wq, &wait, 
> TASK_INTERRUPTIBLE);
> -
> - if (kvm_vcpu_

Re: [PATCH 1/1] powerpc/kvm/book3s: Fixes possible 'use after release' of kvm

2019-11-27 Thread Paolo Bonzini
On 26/11/19 18:52, Leonardo Bras wrote:
> Fixes a possible 'use after free' of kvm variable.
> It does use mutex_unlock(&kvm->lock) after possible freeing a variable
> with kvm_put_kvm(kvm).
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 3 +--
>  virt/kvm/kvm_main.c  | 8 
>  2 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 5834db0a54c6..a402ead833b6 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -316,14 +316,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  
>   if (ret >= 0)
>   list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
> - else
> - kvm_put_kvm(kvm);
>  
>   mutex_unlock(&kvm->lock);
>  
>   if (ret >= 0)
>   return ret;
>  
> + kvm_put_kvm(kvm);
>   kfree(stt);
>   fail_acct:
>   account_locked_vm(current->mm, kvmppc_stt_pages(npages), false);

This part is a good change, as it makes the code clearer.  The
virt/kvm/kvm_main.c bits, however, are not necessary as explained by Sean.

Paolo

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13efc291b1c7..f37089b60d09 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2744,10 +2744,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
> u32 id)
>   /* Now it's all set up, let userspace reach it */
>   kvm_get_kvm(kvm);
>   r = create_vcpu_fd(vcpu);
> - if (r < 0) {
> - kvm_put_kvm(kvm);
> + if (r < 0)
>   goto unlock_vcpu_destroy;
> - }
>  
>   kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
>  
> @@ -2771,6 +2769,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
> u32 id)
>   mutex_lock(&kvm->lock);
>   kvm->created_vcpus--;
>   mutex_unlock(&kvm->lock);
> + if (r < 0)
> + kvm_put_kvm(kvm);
>   return r;
>  }
>  
> @@ -3183,10 +3183,10 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>   kvm_get_kvm(kvm);
>   ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | 
> O_CLOEXEC);
>   if (ret < 0) {
> - kvm_put_kvm(kvm);
>   mutex_lock(&kvm->lock);
>   list_del(&dev->vm_node);
>   mutex_unlock(&kvm->lock);
> + kvm_put_kvm(kvm);
>   ops->destroy(dev);
>   return ret;
>   }
> 



Re: [PATCH] kvm_host: unify VM_STAT and VCPU_STAT definitions in a single place

2020-04-14 Thread Paolo Bonzini
On 13/04/20 23:34, Philippe Mathieu-Daudé wrote:
>> +#define VM_STAT(x, ...) offsetof(struct kvm, stat.x), KVM_STAT_VM, ## 
>> __VA_ARGS__
>> +#define VCPU_STAT(x, ...) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU, 
>> ## __VA_ARGS__
> I find this macro expanding into multiple fields odd... Maybe a matter
> of taste. Sugggestion, have the macro define the full structure, as in
> the arm64 arch:
> 
> #define VM_STAT(n, x, ...) { n, offsetof(struct kvm, stat.x),
> KVM_STAT_VM, ## __VA_ARGS__ }
> 
> Ditto for VCPU_STAT().
> 

Yes, that's a good idea.  Emanuele, can you switch it to this format?

Thanks,

Paolo



Re: [PATCH v2] KVM: Optimize kvm_arch_vcpu_ioctl_run function

2020-04-16 Thread Paolo Bonzini
On 16/04/20 07:10, Tianjia Zhang wrote:
> In earlier versions of kvm, 'kvm_run' is an independent structure
> and is not included in the vcpu structure. At present, 'kvm_run'
> is already included in the vcpu structure, so the parameter
> 'kvm_run' is redundant.
> 
> This patch simplify the function definition, removes the extra
> 'kvm_run' parameter, and extract it from the 'kvm_vcpu' structure
> if necessary.
> 
> Signed-off-by: Tianjia Zhang 
> ---
> 
> v2 change:
>   remove 'kvm_run' parameter and extract it from 'kvm_vcpu'
> 
>  arch/mips/kvm/mips.c   |  3 ++-
>  arch/powerpc/kvm/powerpc.c |  3 ++-
>  arch/s390/kvm/kvm-s390.c   |  3 ++-
>  arch/x86/kvm/x86.c | 11 ++-
>  include/linux/kvm_host.h   |  2 +-
>  virt/kvm/arm/arm.c |  6 +++---
>  virt/kvm/kvm_main.c|  2 +-
>  7 files changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index 8f05dd0a0f4e..ec24adf4857e 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -439,8 +439,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
> *vcpu,
>   return -ENOIOCTLCMD;
>  }
>  
> -int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  {
> + struct kvm_run *run = vcpu->run;
>   int r = -EINTR;
>  
>   vcpu_load(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index e15166b0a16d..7e24691e138a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -1764,8 +1764,9 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, 
> struct kvm_one_reg *reg)
>   return r;
>  }
>  
> -int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  {
> + struct kvm_run *run = vcpu->run;
>   int r;
>  
>   vcpu_load(vcpu);
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 19a81024fe16..443af3ead739 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -4333,8 +4333,9 @@ static void store_regs(struct kvm_vcpu *vcpu, struct 
> kvm_run *kvm_run)
>   store_regs_fmt2(vcpu, kvm_run);
>  }
>  
> -int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
> +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  {
> + struct kvm_run *kvm_run = vcpu->run;
>   int rc;
>  
>   if (kvm_run->immediate_exit)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3bf2ecafd027..a0338e86c90f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8707,8 +8707,9 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
>   trace_kvm_fpu(0);
>  }
>  
> -int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
> +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  {
> + struct kvm_run *kvm_run = vcpu->run;
>   int r;
>  
>   vcpu_load(vcpu);
> @@ -8726,18 +8727,18 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
> struct kvm_run *kvm_run)
>   r = -EAGAIN;
>   if (signal_pending(current)) {
>   r = -EINTR;
> - vcpu->run->exit_reason = KVM_EXIT_INTR;
> + kvm_run->exit_reason = KVM_EXIT_INTR;
>   ++vcpu->stat.signal_exits;
>   }
>   goto out;
>   }
>  
> - if (vcpu->run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) {
> + if (kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) {
>   r = -EINVAL;
>   goto out;
>   }
>  
> - if (vcpu->run->kvm_dirty_regs) {
> + if (kvm_run->kvm_dirty_regs) {
>   r = sync_regs(vcpu);
>   if (r != 0)
>   goto out;
> @@ -8767,7 +8768,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
> struct kvm_run *kvm_run)
>  
>  out:
>   kvm_put_guest_fpu(vcpu);
> - if (vcpu->run->kvm_valid_regs)
> + if (kvm_run->kvm_valid_regs)
>   store_regs(vcpu);
>   post_kvm_run_save(vcpu);
>   kvm_sigset_deactivate(vcpu);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6d58beb65454..1e17ef719595 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -866,7 +866,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
>   struct kvm_mp_state *mp_state);
>  int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>   struct kvm_guest_debug *dbg);
> -int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
> +int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);
>  
>  int kvm_arch_init(void *opaque);
>  void kvm_arch_exit(void);
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 48d0ec44ad77..f5390ac2165b 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -639,7 +639,6 @@ static void 

Re: [PATCH 6/8] simplefs: add file creation functions

2020-04-20 Thread Paolo Bonzini
On 20/04/20 16:28, Greg Kroah-Hartman wrote:
>> I assume you meant a new file. These new functions are used only by a few
>> filesystems, and I didn't want to include them in vmlinux unconditionally,
>> so I introduced simplefs.c and CONFIG_SIMPLEFS instead of extending libfs.c.
>> In this way only fs that need this code like debugfs and tracefs will load
>> it.
> Nothing "loads it", why not just make these libfs functions instead?  As
> the difference between the two is not obvious at all, please don't make
> things confusing.

I think Emanuele meant "will link it" not "will load it".

Emanuele, you can just move everything to libfs.c and get rid of
CONFIG_SIMPLEFS too.  "Do less" is not an offer you want to turn down!

Thanks,

Paolo



Re: [PATCH v2 6/7] debugfs: switch to simplefs inode creation API

2020-04-27 Thread Paolo Bonzini
On 21/04/20 15:57, Emanuele Giuseppe Esposito wrote:
> - inode = debugfs_get_inode(dentry->d_sb);

You're not removing debugfs_get_inode so I think you're going to get a
warning (same in tracefs)?

You can wait a few more days for reviews and/or Acked-bys (especially
for patches 6 and 7) and then post v3.

Since the touch-everything patch (#2) has already been reviewed, and
it's mechanical and not introducing any semantic change, you can
probably reduce the To/Cc list to filesystem, debugfs and tracefs
maintainers.

Thanks,

Paolo



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread Paolo Bonzini
On 05/05/20 18:53, Jim Mattson wrote:
>>> Since this is becoming a generic API (good!!), maybe we can discuss
>>> possible ways to optimize gathering of stats in mass?
>> Sure, the idea of a binary format was considered from the beginning in
>> [1], and it can be done either together with the current filesystem, or
>> as a replacement via different mount options.
> 
> ASCII stats are not scalable. A binary format is definitely the way to go.

I am totally in favor of having a binary format, but it should be
introduced as a separate series on top of this one---and preferably by
someone who has already put some thought into the problem (which
Emanuele and I have not, beyond ensuring that the statsfs concept and
API is flexible enough).

ASCII stats are necessary for quick userspace consumption and for
backwards compatibility with KVM debugfs (which is not an ABI, but it's
damn useful and should not be dropped without providing something as
handy), so this is what this series starts from.

Paolo



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread Paolo Bonzini
On 05/05/20 19:07, David Rientjes wrote:
>> I am totally in favor of having a binary format, but it should be
>> introduced as a separate series on top of this one---and preferably by
>> someone who has already put some thought into the problem (which
>> Emanuele and I have not, beyond ensuring that the statsfs concept and
>> API is flexible enough).
>>
> The concern is that once this series is merged then /sys/kernel/stats 
> could be considered an ABI and there would be a reasonable expectation 
> that it will remain stable, in so far as the stats that userspace is 
> interested in are stable and not obsoleted.
> 
> So is this a suggestion that the binary format becomes complementary to 
> statsfs and provide a means for getting all stats from a single subsystem, 
> or that this series gets converted to such a format before it is merged?

The binary format should be complementary.  The ASCII format should
indeed be considered stable even though individual statistics would come
and go.  It may make sense to allow disabling ASCII files via mount
and/or Kconfig options; but either way, the binary format can and should
be added on top.

I have not put any thought into what the binary format would look like
and what its features would be.  For example these are but the first
questions that come to mind:

* would it be possible to read/clear an arbitrary statistic with
pread/pwrite, or do you have to read all of them?

* if userspace wants to read the schema just once and then read the
statistics many times, how is it informed of schema changes?

* and of course the details of how the schema (names of stat and
subsources) is encoded and what details it should include about the
values (e.g. type or just signedness).

Another possibility is to query stats via BPF.  This could be a third
way to access the stats, or it could be alternative to a binary format.

Paolo



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-08 Thread Paolo Bonzini
[Answering for Emanuele because he's not available until Monday]

On 07/05/20 19:45, Jonathan Adams wrote:
> This is good work.  As David Rientjes mentioned, I'm currently investigating
> a similar project, based on a google-internal debugfs-based FS we call
> "metricfs".  It's
> designed in a slightly different fashion than statsfs here is, and the
> statistics exported are
> mostly fed into our OpenTelemetry-like system.  We're motivated by
> wanting an upstreamed solution, so that we can upstream the metrics we
> create that are of general interest, and lower the overall rebasing
> burden for our tree.

Cool.  We included a public reading API exactly so that there could be
other "frontends".  I was mostly thinking of BPF as an in-tree user, but
your metricfs could definitely use the reading API.

>  - the 8/16/32/64 signed/unsigned integers seems like a wart, and the
> built-in support to grab any offset from a structure doesn't seem like
> much of an advantage. A simpler interface would be to just support an> 
> "integer" (possibly signed/unsigned) type, which is always 64-bit, and
> allow the caller to provide a function pointer to retrieve the value,
> with one or two void *s cbargs.  Then the framework could provide an
> offset-based callback (or callbacks) similar to the existing
> functionality, and a similar one for per-CPU based statistics.  A
> second "clear" callback could be optionally provided to allow for
> statistics to be cleared, as in your current proposal.

Ok, so basically splitting get_simple_value into many separate
callbacks.  The callbacks would be in a struct like

struct stats_fs_type {
uint64_t (*get)(struct stats_fs_value *, void *);
void (*clear)(struct stats_fs_value *, void *);
bool signed;
}

static uint64_t stats_fs_get_u8(struct stats_fs_value *val, void *base)
{
return *((uint8_t *)(base + (uintptr_t)val->arg);
}

static void stats_fs_clear_u8(struct stats_fs_value *val, void *base)
{
*((uint8_t *)(base + (uintptr_t)val->arg) = 0;
}

struct stats_fs_type stats_fs_type_u8 = {
stats_fs_get_u8,
stats_fs_clear_u8,
false
};

and custom types can be defined using "&(struct stats_fs_type) {...}".

>  - Beyond the statistic's type, one *very* useful piece of metadata
> for telemetry tools is knowing whether a given statistic is
> "cumulative" (an unsigned counter which is only ever increased), as
> opposed to a floating value (like "amount of memory used").

Good idea.  Also, clearing does not make sense for a floating value, so
we can use cumulative/floating to get a default for the mode: KVM
statistics for example are mostly cumulative and mode 644, except a few
that are floating and those are all mode 444.  Therefore it makes sense
to add cumulative/floating even before outputting it as metadata.

> I'm more
> concerned with getting the statistics model and capabilities right
> from the beginning, because those are harder to adjust later.

Agreed.

> 1. Each metricfs metric can have one or two string or integer "keys".
> If these exist, they expand the metric from a single value into a
> multi-dimensional table. For example, we use this to report a hash
> table we keep of functions calling "WARN()", in a 'warnings'
> statistic:
> 
> % cat .../warnings/values
> x86_pmu_stop 1
> %
>
> Indicates that the x86_pmu_stop() function has had a WARN() fire once
> since the system was booted.  If multiple functions have fired
> WARN()s, they are listed in this table with their own counts. [1]  We
> also use these to report per-CPU counters on a CPU-by-CPU basis:
> 
> % cat .../irq_x86/NMI/values
> 0 42
> 1 18
> ... one line per cpu
> % cat .../rx_bytes/values
> lo 501360681
> eth0 1457631256

These seem like two different things.

The percpu and per-interface values are best represented as subordinate
sources, one per CPU and one per interface.  For interfaces I would just
use a separate directory, but it doesn't really make sense for CPUs.  So
if we can cater for it in the model, it's better.  For example:

- add a new argument to statsfs_create_source and statsfs_create_values
that makes it not create directories and files respectively.

- add a new "aggregate function" STATS_FS_LIST that directs the parent
to build a table of all the simple values below it

We can also add a helper statsfs_add_values_percpu that creates a new
source for each CPU, I think.

The warnings one instead is a real hash table.  It should be possible to
implement it as some kind of customized aggregation, that is implemented
in the client instead of coming from subordinate sources.  The
presentation can then just use STATS_FS_LIST.  I don't see anything in
the design that is a blocker.

> 2.  We also export some metadata about each statistic.  For example,
> the metadata for the NMI counter above looks like:
> 
> % cat .../NMI/annotations
> DESCRIPTION Non-maskable\ interrupts
> CUMULATIVE
> % cat .../NMI/fields
> cpu value
> int int
> %

Good i

Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-11 Thread Paolo Bonzini
Hi Jonathan, I think the remaining sticky point is this one:

On 11/05/20 19:02, Jonathan Adams wrote:
> I think I'd characterize this slightly differently; we have a set of
> statistics which are essentially "in parallel":
> 
>   - a variety of statistics, N CPUs they're available for, or
>   - a variety of statistics, N interfaces they're available for.
>   - a variety of statistics, N kvm object they're available for.
> 
> Recreating a parallel hierarchy of statistics any time we add/subtract
> a CPU or interface seems like a lot of overhead.  Perhaps a better 
> model would be some sort of "parameter enumn" (naming is hard;
> parameter set?), so when a CPU/network interface/etc is added you'd
> add its ID to the "CPUs" we know about, and at removal time you'd
> take it out; it would have an associated cbarg for the value getting
> callback.
> 
>> Yep, the above "not create a dentry" flag would handle the case where
>> you sum things up in the kernel because the more fine grained counters
>> would be overwhelming.
>
> nodnod; or the callback could handle the sum itself.

In general for statsfs we took a more explicit approach where each
addend in a sum is a separate stats_fs_source.  In this version of the
patches it's also a directory, but we'll take your feedback and add both
the ability to hide directories (first) and to list values (second).

So, in the cases of interfaces and KVM objects I would prefer to keep
each addend separate.

For CPUs that however would be pretty bad.  Many subsystems might
accumulate stats percpu for performance reason, which would then be
exposed as the sum (usually).  So yeah, native handling of percpu values
makes sense.  I think it should fit naturally into the same custom
aggregation framework as hash table keys, we'll see if there's any devil
in the details.

Core kernel stats such as /proc/interrupts or /proc/stat are the
exception here, since individual per-CPU values can be vital for
debugging.  For those, creating a source per stat, possibly on-the-fly
at hotplug/hot-unplug time because NR_CPUS can be huge, would still be
my preferred way to do it.

Thanks,

Paolo



Re: [PATCH V4 2/15] KVM/MMU: Add tlb flush with range helper function

2018-10-15 Thread Paolo Bonzini
On 14/10/2018 10:16, Thomas Gleixner wrote:
>>> +static inline bool kvm_available_flush_tlb_with_range(void)
>>> +{
>>> +   return kvm_x86_ops->tlb_remote_flush_with_range;
>>> +}
>> Seems that kvm_available_flush_tlb_with_range() is not used in this patch…
> What's wrong with that? 
> 
> It provides the implementation and later patches make use of it. It's a
> sensible way to split patches into small, self contained entities.

That's true, on the other hand I have indeed a concerns with this patch:
this series is not bisectable at all, because all the new code is dead
until the very last patch.  Uses of the new feature should come _after_
the implementation.

I don't have any big problem with what Liran pointed out (and I can live
with the unused static functions that would warn with -Wunused, too),
but the above should be fixed in v5, basically by moving patches 12-15
at the beginning of the series.

Paolo


Re: [PATCH V4 00/15] x86/KVM/Hyper-v: Add HV ept tlb range flush hypercall support in KVM

2018-10-15 Thread Paolo Bonzini
On 13/10/2018 16:53, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> For nested memory virtualization, Hyper-v doesn't set write-protect
> L1 hypervisor EPT page directory and page table node to track changes 
> while it relies on guest to tell it changes via HvFlushGuestAddressLlist
> hypercall. HvFlushGuestAddressLlist hypercall provides a way to flush
> EPT page table with ranges which are specified by L1 hypervisor.
> 
> If L1 hypervisor uses INVEPT or HvFlushGuestAddressSpace hypercall to
> flush EPT tlb, Hyper-V will invalidate associated EPT shadow page table
> and sync L1's EPT table when next EPT page fault is triggered.
> HvFlushGuestAddressLlist hypercall helps to avoid such redundant EPT
> page fault and synchronization of shadow page table.

So I just told you that the first part is well understood but I must
retract that; after carefully reviewing the whole series, I think one of
us is actually very confused.

I am not afraid to say it can be me, but my understanding is that you're
passing L1 GPAs to the hypercall and instead the spec says it expects L2
GPAs.  (Consider that, because KVM's shadow paging code is shared
between nested EPT and !EPT cases, every time you see gpa/gfn in the
code it is for L1, while nested EPT GPAs are really what the code calls
gva.)

What's going on?

Paolo


Re: [Resend PATCH V5 0/10] x86/KVM/Hyper-v: Add HV ept tlb range flush hypercall support in KVM

2018-12-21 Thread Paolo Bonzini
On 06/12/18 14:21, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> For nested memory virtualization, Hyper-v doesn't set write-protect
> L1 hypervisor EPT page directory and page table node to track changes 
> while it relies on guest to tell it changes via HvFlushGuestAddressLlist
> hypercall. HvFlushGuestAddressLlist hypercall provides a way to flush
> EPT page table with ranges which are specified by L1 hypervisor.
> 
> If L1 hypervisor uses INVEPT or HvFlushGuestAddressSpace hypercall to
> flush EPT tlb, Hyper-V will invalidate associated EPT shadow page table
> and sync L1's EPT table when next EPT page fault is triggered.
> HvFlushGuestAddressLlist hypercall helps to avoid such redundant EPT
> page fault and synchronization of shadow page table.
> 
> This patchset is based on the Patch "KVM/VMX: Check ept_pointer before
> flushing ept tlb"(https://marc.info/?l=kvm&m=154408169705686&w=2).
> 
> Change since v4:
>1) Split flush address and flush list patches. This patchset only 
> contains
>flush address patches. Will post flush list patches later.
>2) Expose function hyperv_fill_flush_guest_mapping_list()
>out of hyperv file
>3) Adjust parameter of hyperv_flush_guest_mapping_range()
>4) Reorder patchset and move Hyper-V and VMX changes ahead.
> 
> Change since v3:
> 1) Remove code of updating "tlbs_dirty" in 
> kvm_flush_remote_tlbs_with_range()
> 2) Remove directly tlb flush in the kvm_handle_hva_range()
> 3) Move tlb flush in kvm_set_pte_rmapp() to 
> kvm_mmu_notifier_change_pte()
> 4) Combine Vitaly's "don't pass EPT configuration info to
> vmx_hv_remote_flush_tlb()" fix
> 
> Change since v2:
>1) Fix comment in the kvm_flush_remote_tlbs_with_range()
>2) Move HV_MAX_FLUSH_PAGES and HV_MAX_FLUSH_REP_COUNT to
> hyperv-tlfs.h.
>3) Calculate HV_MAX_FLUSH_REP_COUNT in the macro definition
>4) Use HV_MAX_FLUSH_REP_COUNT to define length of gpa_list in
> struct hv_guest_mapping_flush_list.
> 
> Change since v1:
>1) Convert "end_gfn" of struct kvm_tlb_range to "pages" in order
>   to avoid confusion as to whether "end_gfn" is inclusive or exlusive.
>2) Add hyperv tlb range struct and replace kvm tlb range struct
>   with new struct in order to avoid using kvm struct in the hyperv
>   code directly.
> 
> 
> 
> Lan Tianyu (10):
>   KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
>   x86/hyper-v: Add HvFlushGuestAddressList hypercall support
>   x86/Hyper-v: Add trace in the
> hyperv_nested_flush_guest_mapping_range()
>   KVM/VMX: Add hv tlb range flush support
>   KVM/MMU: Add tlb flush with range helper function
>   KVM: Replace old tlb flush function with new one to flush a specified
> range.
>   KVM: Make kvm_set_spte_hva() return int
>   KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to
> kvm_mmu_notifier_change_pte()
>   KVM/MMU: Flush tlb directly in the kvm_set_pte_rmapp()
>   KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
> 
>  arch/arm/include/asm/kvm_host.h |  2 +-
>  arch/arm64/include/asm/kvm_host.h   |  2 +-
>  arch/mips/include/asm/kvm_host.h|  2 +-
>  arch/mips/kvm/mmu.c |  3 +-
>  arch/powerpc/include/asm/kvm_host.h |  2 +-
>  arch/powerpc/kvm/book3s.c   |  3 +-
>  arch/powerpc/kvm/e500_mmu_host.c|  3 +-
>  arch/x86/hyperv/nested.c| 80 +++
>  arch/x86/include/asm/hyperv-tlfs.h  | 32 +
>  arch/x86/include/asm/kvm_host.h |  9 +++-
>  arch/x86/include/asm/mshyperv.h | 15 ++
>  arch/x86/include/asm/trace/hyperv.h | 14 ++
>  arch/x86/kvm/mmu.c  | 96 
> +
>  arch/x86/kvm/paging_tmpl.h  |  3 +-
>  arch/x86/kvm/vmx.c  | 63 +---
>  virt/kvm/arm/mmu.c  |  6 ++-
>  virt/kvm/kvm_main.c |  5 +-
>  17 files changed, 292 insertions(+), 48 deletions(-)
> 

Queued, thanks.

Paolo


Re: [Resend PATCH V5 2/10] x86/hyper-v: Add HvFlushGuestAddressList hypercall support

2018-12-21 Thread Paolo Bonzini
On 06/12/18 14:21, lantianyu1...@gmail.com wrote:
>  static inline int hyperv_flush_guest_mapping(u64 as) { return -1; }
> +static inline int hyperv_flush_guest_mapping_range(u64 as,
> + hyperv_fill_flush_list_func fill_func, void *data);
> +{
> + return -1;

This part for !IS_ENABLED(CONFIG_HYPERV) does not compile.

No big deal, but please add that to your testing procedures.

Paolo


Re: [PATCH 6/11] KVM/MMU: Flush tlb with range list in sync_page()

2019-01-07 Thread Paolo Bonzini
On 04/01/19 17:30, Sean Christopherson wrote:
>> +
>> +if (kvm_available_flush_tlb_with_range()
>> +&& (tmp_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)) {
>> +struct kvm_mmu_page *leaf_sp = page_header(sp->spt[i]
>> +& PT64_BASE_ADDR_MASK);
>> +list_add(&leaf_sp->flush_link, &flush_list);
>> +}
>> +
>> +set_spte_ret |= tmp_spte_ret;
>> +
>>  }
>>  
>>  if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)
>> -kvm_flush_remote_tlbs(vcpu->kvm);
>> +kvm_flush_remote_tlbs_with_list(vcpu->kvm, &flush_list);
> This is a bit confusing and potentially fragile.  It's not obvious that
> kvm_flush_remote_tlbs_with_list() is guaranteed to call
> kvm_flush_remote_tlbs() when kvm_available_flush_tlb_with_range() is
> false, and you're relying on the kvm_flush_remote_tlbs_with_list() call
> chain to never optimize away the empty list case.  Rechecking
> kvm_available_flush_tlb_with_range() isn't expensive.
> 

Alternatively, do not check it during the loop: always build the
flush_list, and always call kvm_flush_remote_tlbs_with_list.  The
function can then check whether the list is empty, and the OR of
tmp_spte_ret on every iteration goes away.

Paolo


Re: [PATCH 7/11] KVM: Remove redundant check in the kvm_get_dirty_log_protect()

2019-01-07 Thread Paolo Bonzini
On 04/01/19 16:50, Sean Christopherson wrote:
> Tangentially related, does mmu_lock actually need to be held while we
> walk dirty_bitmap in kvm_{clear,get}_dirty_log_protect()?  The bitmap
> itself is protected by slots_lock (a lockdep assertion would be nice
> too), e.g. can we grab the lock iff dirty_bitmap[i] != 0?

Yes, we could avoid grabbing it as long as the bitmap is zero.  However,
without kvm->manual_dirty_log_protect, the granularity of
kvm_get_dirty_log_protect() is too coarse so it won't happen in
practice.  Instead, with the new manual clear,
kvm_get_dirty_log_protect() does not take the lock and a well-written
userspace is not going to call the clear ioctl unless some bits are set.

Paolo


Re: [PATCH 9/11] KVM/MMU: Flush tlb in the kvm_mmu_write_protect_pt_masked()

2019-01-07 Thread Paolo Bonzini
On 04/01/19 09:54, lantianyu1...@gmail.com wrote:
>   rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + 
> __ffs(mask),
> PT_PAGE_TABLE_LEVEL, slot);
> - __rmap_write_protect(kvm, rmap_head, false);
> + flush |= __rmap_write_protect(kvm, rmap_head, false);
>  
>   /* clear the first set bit */
>   mask &= mask - 1;
>   }
> +
> + if (flush && kvm_available_flush_tlb_with_range()) {
> + kvm_flush_remote_tlbs_with_address(kvm,
> + slot->base_gfn + gfn_offset,
> + hweight_long(mask));

Mask is zero here, so this probably won't work.

In addition, I suspect calling the hypercall once for every 64 pages is
not very efficient.  Passing a flush list into
kvm_mmu_write_protect_pt_masked, and flushing in
kvm_arch_mmu_enable_log_dirty_pt_masked, isn't efficient either because
kvm_arch_mmu_enable_log_dirty_pt_masked is also called once per word.

I don't have any good ideas, except for moving the whole
kvm_clear_dirty_log_protect loop into architecture-specific code (which
is not the direction we want---architectures should share more code, not
less).

Paolo

> + flush = false;
> + }
> +



Re: [PATCH 3/11] KVM: Add spte's point in the struct kvm_mmu_page

2019-01-07 Thread Paolo Bonzini
On 04/01/19 09:53, lantianyu1...@gmail.com wrote:
> @@ -332,6 +332,7 @@ struct kvm_mmu_page {
>   int root_count;  /* Currently serving as active root */
>   unsigned int unsync_children;
>   struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> + u64 *sptep;

Is this really needed?  Can we put the "last" flag in the struct instead
as a bool?  In fact, if you do

u16 unsync_children;
bool unsync;
bool last_level;

the struct does not grow at all. :)

(I'm not sure where "large" is tested using the sptep field, even though
it is in the commit message).

Paolo

>   /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>   unsigned long mmu_valid_gen;



Re: [PATCH 4/11] KVM/MMU: Introduce tlb flush with range list

2019-01-07 Thread Paolo Bonzini
On 04/01/19 09:53, lantianyu1...@gmail.com wrote:
>  struct kvm_mmu_page {
>   struct list_head link;
> +
> + /*
> +  * Tlb flush with range list uses struct kvm_mmu_page as list entry
> +  * and all list operations should be under protection of mmu_lock.
> +  */
> + struct list_head flush_link;
>   struct hlist_node hash_link;
>   bool unsync;
>  
> @@ -443,6 +449,7 @@ struct kvm_mmu {

Again, it would be nice not to grow the struct too much, though I
understand that it's already relatively big (168 bytes).

Can you at least make this an hlist, so that it only takes a single word?

Paolo


Re: [PATCH V2 3/10] KVM/MMU: Add last_level in the struct mmu_spte_page

2019-02-14 Thread Paolo Bonzini
On 02/02/19 02:38, lantianyu1...@gmail.com wrote:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ce770b446238..70cafd3f95ab 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2918,6 +2918,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  
>   if (level > PT_PAGE_TABLE_LEVEL)
>   spte |= PT_PAGE_SIZE_MASK;
> +
> + sp->last_level = is_last_spte(spte, level);

sp->last_level is always true here.

Paolo

>   if (tdp_enabled)
>   spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
>   kvm_is_mmio_pfn(pfn));



Re: [PATCH V2 3/10] KVM/MMU: Add last_level in the struct mmu_spte_page

2019-02-14 Thread Paolo Bonzini
On 02/02/19 02:38, lantianyu1...@gmail.com wrote:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index ce770b446238..70cafd3f95ab 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2918,6 +2918,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  
>   if (level > PT_PAGE_TABLE_LEVEL)
>   spte |= PT_PAGE_SIZE_MASK;
> +
> + sp->last_level = is_last_spte(spte, level);

Wait, I wasn't thinking straight.  If a struct kvm_mmu_page exists, it
is never the last level.  Page table entries for the last level do not
have a struct kvm_mmu_page.

Therefore you don't need the flag after all.  I suspect your
calculations in patch 2 are off by one, and you actually need

hlist_for_each_entry(sp, range->flush_list, flush_link) {
int pages = KVM_PAGES_PER_HPAGE(sp->role.level + 1);
...
}

For example, if sp->role.level is 1 then the struct kvm_mmu_page is for
a page containing PTEs and covers an area of 2 MiB.

Thanks,

Paolo

>   if (tdp_enabled)
>   spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
>   kvm_is_mmio_pfn(pfn));



Re: [PATCH V2 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM

2019-02-14 Thread Paolo Bonzini
On 02/02/19 02:38, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> This patchset is to introduce hv ept tlb range list flush function
> support in the KVM MMU component. Flushing ept tlbs of several address
> range can be done via single hypercall and new list flush function is
> used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset
> also adds more hv ept tlb range flush support in more KVM MMU function.
> 
> Change since v1:
>1) Make flush list as a hlist instead of list in order to 
>keep struct kvm_mmu_page size.
>2) Add last_level flag in the struct kvm_mmu_page instead
>of spte pointer
>3) Move tlb flush from kvm_mmu_notifier_clear_flush_young() to 
> kvm_age_hva()
>4) Use range flush in the kvm_vm_ioctl_get/clear_dirty_log()

Looks good except for the confusion on sp->last_level (which was mostly
mine---sorry about that).  I think it can still make 5.1.

However, note that I've never received "KVM/MMU: Use tlb range flush  in
the kvm_age_hva()".

Paolo


Re: [PATCH V2 3/10] KVM/MMU: Add last_level in the struct mmu_spte_page

2019-02-15 Thread Paolo Bonzini
On 15/02/19 16:05, Tianyu Lan wrote:
> Yes, you are right. Thanks to point out and will fix. The last_level
> flag is to avoid adding middle page node(e.g, PGD, PMD)
> into flush list. The address range will be duplicated if adding both
> leaf, node and middle node into flush list.

Hmm, that's not easy to track.  One kvm_mmu_page could include both leaf
and non-leaf page (for example a huge page for 0 to 2 MB and a page
table for 2 MB to 4 MB).

Is this really needed?  First, your benchmarks so far have been done
with sp->last_level always set to true.  Second, you would only
encounter this optimization in kvm_mmu_commit_zap_page when zapping a 1
GB region (which then would be invalidated twice, at both the PMD and
PGD level) or bigger.

Paolo


Re: [PATCH V3 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM

2019-02-22 Thread Paolo Bonzini
On 22/02/19 16:06, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> This patchset is to introduce hv ept tlb range list flush function
> support in the KVM MMU component. Flushing ept tlbs of several address
> range can be done via single hypercall and new list flush function is
> used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset
> also adds more hv ept tlb range flush support in more KVM MMU function.
> 
> This patchset is based on the fix patch "x86/Hyper-V: Fix definition 
> HV_MAX_FLUSH_REP_COUNT".
> (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1939455.html)

Note that this won't make it in 5.1 unless Linus releases an -rc8.
Otherwise, I'll get to it next week.

Thanks,

Paolo


Re: [PATCH] kvm: no need to check return value of debugfs_create functions

2018-05-29 Thread Paolo Bonzini
On 29/05/2018 18:22, Greg Kroah-Hartman wrote:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> This cleans up the error handling a lot, as this code will never get
> hit.
> 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Arvind Yadav 
> Cc: Eric Auger 
> Cc: Andre Przywara 
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: kvm...@lists.cs.columbia.edu
> Cc: k...@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman 
> ---
>  arch/powerpc/kvm/book3s_hv.c   |  3 +--
>  virt/kvm/arm/vgic/vgic-debug.c | 17 ---
>  virt/kvm/arm/vgic/vgic.h   |  4 ++--
>  virt/kvm/kvm_main.c| 40 +++---
>  4 files changed, 15 insertions(+), 49 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 4d07fca5121c..67d7de1470cc 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3950,8 +3950,7 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
>*/
>   snprintf(buf, sizeof(buf), "vm%d", current->pid);
>   kvm->arch.debugfs_dir = debugfs_create_dir(buf, kvm_debugfs_dir);
> - if (!IS_ERR_OR_NULL(kvm->arch.debugfs_dir))
> - kvmppc_mmu_debugfs_init(kvm);
> + kvmppc_mmu_debugfs_init(kvm);
>  
>   return 0;
>  }
> diff --git a/virt/kvm/arm/vgic/vgic-debug.c b/virt/kvm/arm/vgic/vgic-debug.c
> index 10b38178cff2..0140b29079b6 100644
> --- a/virt/kvm/arm/vgic/vgic-debug.c
> +++ b/virt/kvm/arm/vgic/vgic-debug.c
> @@ -263,21 +263,12 @@ static const struct file_operations vgic_debug_fops = {
>   .release = seq_release
>  };
>  
> -int vgic_debug_init(struct kvm *kvm)
> +void vgic_debug_init(struct kvm *kvm)
>  {
> - if (!kvm->debugfs_dentry)
> - return -ENOENT;
> -
> - if (!debugfs_create_file("vgic-state", 0444,
> -  kvm->debugfs_dentry,
> -  kvm,
> -  &vgic_debug_fops))
> - return -ENOMEM;
> -
> - return 0;
> + debugfs_create_file("vgic-state", 0444, kvm->debugfs_dentry, kvm,
> + &vgic_debug_fops);
>  }
>  
> -int vgic_debug_destroy(struct kvm *kvm)
> +void vgic_debug_destroy(struct kvm *kvm)
>  {
> - return 0;
>  }
> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> index 830e815748a0..3c38c5349953 100644
> --- a/virt/kvm/arm/vgic/vgic.h
> +++ b/virt/kvm/arm/vgic/vgic.h
> @@ -229,8 +229,8 @@ void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct 
> vgic_vmcr *vmcr);
>  int vgic_lazy_init(struct kvm *kvm);
>  int vgic_init(struct kvm *kvm);
>  
> -int vgic_debug_init(struct kvm *kvm);
> -int vgic_debug_destroy(struct kvm *kvm);
> +void vgic_debug_init(struct kvm *kvm);
> +void vgic_debug_destroy(struct kvm *kvm);
>  
>  bool lock_all_vcpus(struct kvm *kvm);
>  void unlock_all_vcpus(struct kvm *kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c7b2e927f699..0ad400f353fc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -572,10 +572,7 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
>   return 0;
>  
>   snprintf(dir_name, sizeof(dir_name), "%d-%d", task_pid_nr(current), fd);
> - kvm->debugfs_dentry = debugfs_create_dir(dir_name,
> -  kvm_debugfs_dir);
> - if (!kvm->debugfs_dentry)
> - return -ENOMEM;
> + kvm->debugfs_dentry = debugfs_create_dir(dir_name, kvm_debugfs_dir);
>  
>   kvm->debugfs_stat_data = kcalloc(kvm_debugfs_num_entries,
>sizeof(*kvm->debugfs_stat_data),
> @@ -591,11 +588,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
>   stat_data->kvm = kvm;
>   stat_data->offset = p->offset;
>   kvm->debugfs_stat_data[p - debugfs_entries] = stat_data;
> - if (!debugfs_create_file(p->name, 0644,
> -  kvm->debugfs_dentry,
> -  stat_data,
> -  stat_fops_per_vm[p->kind]))
> - return -ENOMEM;
> + debugfs_create_file(p-&

Re: [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().


Reviewed-by: Paolo Bonzini 


Signed-off-by: Sean Christopherson 
---
  virt/kvm/kvm_main.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+   KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
  }
  
  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
 */
if (wake)
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-   BUG_ON(kvm->mmu_invalidate_in_progress < 0);
  }
  
  static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,




Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:
From: Chao Peng  Currently in mmu_notifier 
invalidate path, hva range is recorded and then checked against by 
mmu_notifier_retry_hva() in the page fault handling path. However, for 
the to be introduced private memory, a page fault may not have a hva 
associated, checking gfn(gpa) makes more sense. For existing hva based 
shared memory, gfn is expected to also work. The only downside is when 
aliasing multiple gfns to a single hva, the current algorithm of 
checking multiple ranges could result in a much larger range being 
rejected. Such aliasing should be uncommon, so the impact is expected 
small.


Reviewed-by: Paolo Bonzini 

Paolo



Re: [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:
Add an assertion that there are no in-progress MMU invalidations when a 
VM is being destroyed, with the exception of the scenario where KVM 
unregisters its MMU notifier between an .invalidate_range_start() call 
and the corresponding .invalidate_range_end(). KVM can't detect unpaired 
calls from the mmu_notifier due to the above exception waiver, but the 
assertion can detect KVM bugs, e.g. such as the bug that *almost* 
escaped initial guest_memfd development.


Link: 
https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com
Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 

Paolo



Re: [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

   #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson 
---
  arch/powerpc/kvm/powerpc.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
break;
  #endif
case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   BUILD_BUG();
+#endif
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-   r = 1;
  #else
-   r = 0;
+   r = 1;
  #endif
break;
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE


Reviewed-by: Paolo Bonzini 



Re: [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.


The new #define should also imply KVM_CAP_SYNC_MMU, or even: 
KVM_CAP_SYNC_MMU should just be enabled by all architectures at this 
point.  You don't need to care about it, I have a larger series for caps 
that are enabled by all architectures and I'll post it for 6.8.


Reviewed-by: Paolo Bonzini 

Paolo



Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:


+   if (ioctl == KVM_SET_USER_MEMORY_REGION)
+   size = sizeof(struct kvm_userspace_memory_region);


This also needs a memset(&mem, 0, sizeof(mem)), otherwise the 
out-of-bounds access of the commit message becomes a kernel stack read.


Probably worth adding a check on valid flags here.

Paolo


+   else
+   size = sizeof(struct kvm_userspace_memory_region2);
+
+   /* Ensure the common parts of the two structs are identical. */
+   SANITY_CHECK_MEM_REGION_FIELD(slot);
+   SANITY_CHECK_MEM_REGION_FIELD(flags);
+   SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+   SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+   SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
 





Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread Paolo Bonzini
On Mon, Oct 30, 2023 at 5:53 PM David Matlack  wrote:
>
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > From: Chao Peng 
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
>   
>
> Is there a missing word here?

No but there could be missing hyphens ("for the to-be-introduced
private memory"); possibly a "soon" could help parsing and that is
what you were talking about?

> >   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
>
> I don't think this is incorrect, but I was a little suprised to see this
> here rather than in end() when mmu_invalidate_in_progress decrements to
> 0.

I think that would be incorrect on the very first start?

> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + lockdep_assert_held_write(&kvm->mmu_lock);
>
> Does this compile/function on KVM architectures with
> !KVM_HAVE_MMU_RWLOCK?

Yes:

#define lockdep_assert_held_write(l)\
lockdep_assert(lockdep_is_held_type(l, 0))

where 0 is the lock-held type used by lock_acquire_exclusive. In turn
is what you get for a spinlock or mutex, in addition to a rwlock or
rwsem that is taken for write.

Instead, lockdep_assert_held() asserts that the lock is taken without
asserting a particular lock-held type.

> > @@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned 
> > long start,
>
> Let's add a lockdep_assert_held_write(&kvm->mmu_lock) here too while
> we're at it?

Yes, good idea.

Paolo



Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).


Could have been split in two patches, but it's fine anyway.

Reviewed-by: Paolo Bonzini 

Paolo



Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.


This is what David suggested to do in patch 3, FWIW.

Reviewed-by: Paolo Bonzini 

Paolo


Reported-by: Isaku Yamahata 
Link: https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson 
---





Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.


It's confusing who "they" is.  Maybe

 * HVA-based notifications provide a userspace address,
 * and as such are only relevant for shared mappings.

Paolo


+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
 
 			/*





Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

   - explicit conversion: happens when the guest explicitly calls into KVM
 to map a range (as private or shared)

   - implicit conversion: happens when the guest attempts to access a gfn
 that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com
Cc: Anish Moorthy 
Cc: David Matlack 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 


---
  Documentation/virt/kvm/api.rst | 41 ++
  arch/x86/kvm/x86.c |  1 +
  include/linux/kvm_host.h   | 11 +
  include/uapi/linux/kvm.h   |  8 +++
  4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ace984acc125..860216536810 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6723,6 +6723,26 @@ array field represents return values. The userspace 
should update the return
  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
  spec refer, https://github.com/riscv/riscv-sbi-doc.
  
+::

+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
  ::
  
  /* KVM_EXIT_NOTIFY */

@@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that 
malicious VMs can
  cause CPU stuck (due to event windows don't open up) and make the CPU
  unavailable to host or other VMs.
  
+7.34 KVM_CAP_MEMORY_FAULT_INFO

+--
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces 

Re: [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox 
Co-developed-by: Vlastimil Babka 
Signed-off-by: Vlastimil Babka 
Signed-off-by: Sean Christopherson 


I think this could even be "From: Vlastimil", but no biggie.

Paolo


---
  include/linux/pagemap.h | 19 +-
  mm/compaction.c | 43 +
  mm/migrate.c|  2 ++
  3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
-   AS_RELEASE_ALWAYS,  /* Call ->release_folio(), even if no private 
data */
+   AS_RELEASE_ALWAYS = 7,  /* Call ->release_folio(), even if no private 
data */
+   AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */
  };
  
  /**

@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct 
address_space *mapping)
clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
  }
  
+static inline void mapping_set_unmovable(struct address_space *mapping)

+{
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, &mapping->flags);
+   set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+   return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
  {
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  
  	/* Time to isolate some pages for migration */

for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
  
  		if (skip_on_failure && low_pfn >= next_skip_pfn) {

/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
  
+		is_unevictable = folio_test_unevictable(folio);

+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
  
  		/*

@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
  
-		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {

-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+   (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
  
  			/*

 * Only folios without mappings or that have
-* a ->migrate_folio callback are possible to
-* migrate without blocking.  However, we may
-* be racing with truncation, which can free
-* the mapping.  Truncation holds the folio lock
-* until after the folio is removed from the page
-* cache so holding it ourselves is sufficient.
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+* holds the folio loc

Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:
Export anon_inode_getfile_secure() so that it can be used by KVM to 
create and manage file-based guest memory without need a fullblow 


without introducing a full-blown

Otherwise,

Reviewed-by: Paolo Bonzini 

Paolo

filesystem. The "standard" anon_inode_getfd() doesn't work for KVM's use 
case as KVM needs a unique inode for each file, e.g. to be able to 
independently manage the size and lifecycle of a given file.




Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:22, Sean Christopherson wrote:

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/x86.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee3cd8c3c0ef..f41dbb1465a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
  {
int r;
  
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;

    vcpu->arch.l1tf_flush_l1d = true;
  
  	for (;;) {


Reviewed-by: Paolo Bonzini 



Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:22, Sean Christopherson wrote:

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).


Reviewed-by: Paolo Bonzini 



Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:22, Sean Christopherson wrote:

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 

with one nit:


+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n


s/support/supported/

Paolo



Re: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes

2023-10-30 Thread Paolo Bonzini

On 10/27/23 20:21, Sean Christopherson wrote:

Non-KVM people, please take a gander at two small-ish patches buried in the
middle of this series:

   fs: Export anon_inode_getfile_secure() for use by KVM
   mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
would be much appreciated.  Note, adding AS_UNMOVABLE isn't strictly required as
it's "just" an optimization, but we'd prefer to have it in place straightaway.


Reporting what I wrote in the other thread, for wider distribution:

I'm going to wait a couple days more for reviews to come in, post a v14
myself, and apply the series to kvm/next as soon as Linus merges the 6.7
changes.  The series will be based on the 6.7 tags/for-linus, and when
6.7-rc1 comes up, I'll do this to straighten the history:

git checkout kvm/next
git tag -s -f kvm-gmem HEAD
git reset --hard v6.7-rc1
git merge tags/kvm-gmem
# fix conflict with Christian Brauner's VFS series
git commit
git push kvm

6.8 is not going to be out for four months, and I'm pretty sure that
anything that would be discovered within "a few weeks" can also be
applied on top, and the heaviness of a 35-patch series will outweigh any
imperfections by a long margin.

(Full disclosure: this is _also_ because I want to apply this series to
the RHEL kernel, and Red Hat has a high level of disdain for
non-upstream patches.  But it's mostly because I want all dependencies
to be able to move on and be developed on top of stock kvm/next).

Paolo



Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Paolo Bonzini

On 10/30/23 21:25, Sean Christopherson wrote:

On Mon, Oct 30, 2023, Paolo Bonzini wrote:

On 10/27/23 20:21, Sean Christopherson wrote:


+   if (ioctl == KVM_SET_USER_MEMORY_REGION)
+   size = sizeof(struct kvm_userspace_memory_region);


This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
access of the commit message becomes a kernel stack read.


Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
though as '0' is a valid file descriptor and a valid file offset.


Either is okay, because unless the flags check is screwed up it should
not matter.  The memset is actually unnecessary, though it may be a good
idea anyway to keep it, aka belt-and-suspenders.


Probably worth adding a check on valid flags here.


Definitely needed.  There's a very real bug here.  But rather than duplicate 
flags
checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
have the fancy guard(mutex) and there are no internal calls to 
kvm_set_memory_region(),
what if we:

   1. Acquire/release slots_lock in __kvm_set_memory_region()
   2. Call kvm_set_memory_region() from x86 code for the internal memslots
   3. Disallow *any* flags for internal memslots
   4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()


I dislike this step, there is a clear point where all paths meet
(ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
I think that's the place where flags should be checked.  (I don't mind
the restriction on internal memslots; it's just that to me it's not a
particularly natural way to structure the checks).

On the other hand, the place where to protect from out-of-bounds
accesses, is the place where you stop caring about struct
kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
your code gets it right, by dropping "ioctl" as soon as possible).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 87f45aa91ced..fe5a2af14fff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
return true;
 }
 
+/*

+ * Flags that do not access any of the extra space of struct
+ * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
+ * only allows these.
+ */
+#define KVM_SET_USER_MEMORY_REGION_FLAGS \
+   (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+
 static int check_memory_region_flags(struct kvm *kvm,
 const struct kvm_userspace_memory_region2 
*mem)
 {
@@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
struct kvm_userspace_memory_region2 mem;
unsigned long size;
 
-		if (ioctl == KVM_SET_USER_MEMORY_REGION)

+   if (ioctl == KVM_SET_USER_MEMORY_REGION) {
+   /*
+* Fields beyond struct kvm_userspace_memory_region 
shouldn't be
+* accessed, but avoid leaking kernel memory in case of 
a bug.
+*/
+   memset(&mem, 0, sizeof(mem));
size = sizeof(struct kvm_userspace_memory_region);
-   else
+   } else {
size = sizeof(struct kvm_userspace_memory_region2);
+   }
 
 		/* Ensure the common parts of the two structs are identical. */

SANITY_CHECK_MEM_REGION_FIELD(slot);
@@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
if (copy_from_user(&mem, argp, size))
goto out;
 
+		r = -EINVAL;

+   if (ioctl == KVM_SET_USER_MEMORY_REGION &&
+   (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
+   goto out;
+
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}


That's a kind of patch that you can't really get wrong (though I have
the brown paper bag ready).

Maintainance-wise it's fine, since flags are being added at a pace of
roughly one every five years, and anyway it's also future proof: I placed
the #define near check_memory_region_flags so that in five years we remember
to keep it up to date.  But worst case, the new flags will only be allowed
by KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
waiting to bite us.

In sum, this is exactly the only kind of fix that should be in the v13->v14
delta.

Paolo



Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-31 Thread Paolo Bonzini
On Tue, Oct 31, 2023 at 11:13 PM Sean Christopherson  wrote:
> On Tue, Oct 31, 2023, Fuad Tabba wrote:
> > On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  
> > wrote:
> Since we now know that at least pKVM will use guest_memfd for shared memory, 
> and
> odds are quite good that "regular" VMs will also do the same, i.e. will want
> guest_memfd with the concept of private memory, I agree that we should avoid
> PRIVATE.
>
> Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or
> KVM_MEM_USE_GUEST_MEMFD).  I.e. do our best to avoid ambiguity between 
> referring
> to "guest memory" at-large and guest_memfd.

I was going to propose KVM_MEM_HAS_GUESTMEMFD.  Any option
is okay for me so, if no one complains, I'll go for KVM_MEM_GUESTMEMFD
(no underscore because I found the repeated "_MEM" distracting).

Paolo



Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Paolo Bonzini
On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson  wrote:
>
> On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > > hugepages. Require userspace to opt-in via a flag even though there's 
> > > > > no
> > > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > > KVM can't provide hugepages.
> > > >
> > > > Personally, it seems not so "transparent" if requiring userspace to 
> > > > opt-in.
> > > >
> > > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > > maximum support hugepage size 3) ensure the size satisfies the 
> > > > alignment;
> > > > before opt-in it.
> > > >
> > > > Even simpler, userspace can blindly try to create guest memfd with
> > > > transparent hugapage flag. If getting error, fallback to create without 
> > > > the
> > > > transparent hugepage flag.
> > > >
> > > > However, it doesn't look transparent to me.
> > >
> > > The "transparent" part is referring to the underlying kernel mechanism, 
> > > it's not
> > > saying anything about the API.  The "transparent" part of THP is that the 
> > > kernel
> > > doesn't guarantee hugepages, i.e. whether or not hugepages are actually 
> > > used is
> > > (mostly) transparent to userspace.
> > >
> > > Paolo also isn't the biggest fan[*], but there are also downsides to 
> > > always
> > > allowing hugepages, e.g. silent failure due to lack of THP or unaligned 
> > > size,
> > > and there's precedent in the form of MADV_HUGEPAGE.
> > >
> > > [*] 
> > > https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a...@redhat.com
> >
> > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > continue.
> >
> > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> > to failure of guest memfd creation.
>
> Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> action from userspace, i.e. instead of ignoring the error, userspace could 
> redo
> KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
>
> We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we 
> could
> extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
> userspace provide advice/hints after creating a guest_memfd.  But I suspect 
> that
> guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation 
> hint
> is actually less desirable.
>
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> compatible size or the kernel doesn't support THP.  An incompatible size is 
> likely
> a userspace bug, and for most setups that want to utilize guest_memfd, lack 
> of THP
> support is likely a configuration bug.  I.e. many/most uses *want* failures 
> due to
> KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
>
> > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
>
> Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they 
> suggest
> the flag is a hint, and hints are usually best effort only, i.e. are ignored 
> if
> there is a fundamental incompatibility.
>
> "Allow" isn't perfect, e.g. I would much prefer a straight 
> KVM_GUEST_MEMFD_USE_HUGEPAGES
> or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM 
> doesn't
> (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger 
> than
> a hint, but weaker than a requirement.  And if/when KVM supports a dedicated 
> memory
> pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

I think that the current patch is fine, but I will adjust it to always
allow the flag,
and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
If hugepages are not guaranteed, and (theoretically) you could have no
hugepage at all in the result, it's okay to get this result even if THP is not
available in the kernel.

Paolo



Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Paolo Bonzini

On 11/1/23 17:36, Sean Christopherson wrote:

"Allow" isn't perfect, e.g. I would much prefer a straight 
KVM_GUEST_MEMFD_USE_HUGEPAGES
or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM 
doesn't
(yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
a hint, but weaker than a requirement.  And if/when KVM supports a dedicated 
memory
pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

I think that the current patch is fine, but I will adjust it to always
allow the flag, and to make the size check even if !CONFIG_TRANSPARENT_HUGEPAGE.
If hugepages are not guaranteed, and (theoretically) you could have no
hugepage at all in the result, it's okay to get this result even if THP is not
available in the kernel.

Can you post a fixup patch?  It's not clear to me exactly what behavior you 
intend
to end up with.


Sure, just this:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7d1a33c2ad42..34fd070e03d9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
 {
loff_t size = args->size;
u64 flags = args->flags;
-   u64 valid_flags = 0;
-
-   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-   valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+   u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
 
 	if (flags & ~valid_flags)

return -EINVAL;
@@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
if (size < 0 || !PAGE_ALIGNED(size))
return -EINVAL;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE

if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
!IS_ALIGNED(size, HPAGE_PMD_SIZE))
return -EINVAL;
-#endif
 
 	return __kvm_gmem_create(kvm, size, flags);

 }

Paolo



Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Paolo Bonzini
On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson  wrote:
>
> On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > > "Allow" isn't perfect, e.g. I would much prefer a straight 
> > > > > KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey 
> > > > > that KVM doesn't
> > > > > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is 
> > > > > stronger than
> > > > > a hint, but weaker than a requirement.  And if/when KVM supports a 
> > > > > dedicated memory
> > > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > > I think that the current patch is fine, but I will adjust it to always
> > > > allow the flag, and to make the size check even if 
> > > > !CONFIG_TRANSPARENT_HUGEPAGE.
> > > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > > hugepage at all in the result, it's okay to get this result even if THP 
> > > > is not
> > > > available in the kernel.
> > > Can you post a fixup patch?  It's not clear to me exactly what behavior 
> > > you intend
> > > to end up with.
> >
> > Sure, just this:
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 7d1a33c2ad42..34fd070e03d9 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > kvm_create_guest_memfd *args)
> >  {
> >   loff_t size = args->size;
> >   u64 flags = args->flags;
> > - u64 valid_flags = 0;
> > -
> > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> >   if (flags & ~valid_flags)
> >   return -EINVAL;
> > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > kvm_create_guest_memfd *args)
> >   if (size < 0 || !PAGE_ALIGNED(size))
> >   return -EINVAL;
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> >   !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> >   return -EINVAL;
> > -#endif
>
> That won't work, HPAGE_PMD_SIZE is valid only for 
> CONFIG_TRANSPARENT_HUGEPAGE=y.
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

Would have caught it when actually testing it, I guess. :) It has to
be PMD_SIZE, possibly with

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
#endif

for extra safety.

Paolo



Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-11-02 Thread Paolo Bonzini

On 11/2/23 04:01, Huang, Kai wrote:

On Fri, 2023-10-27 at 11:21 -0700, Sean Christopherson wrote:

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
 a guest memory range.
   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
 memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Cc: Xu Yilun 
Cc: Mickaël Salaün 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 



[...]


+Note, there is no "get" API.  Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+



[...]

  
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES

+static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t 
gfn)
+{
+   return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}


Only call xa_to_value() when xa_load() returns !NULL?


This xarray does not store a pointer, therefore xa_load() actually 
returns an integer that is tagged with 1 in the low bit:


static inline unsigned long xa_to_value(const void *entry)
{
return (unsigned long)entry >> 1;
}

Returning zero for an empty entry is okay, so the result of xa_load() 
can be used directly.




+
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+unsigned long attrs);


Seems it's not immediately clear why this function is needed in this patch,
especially when you said there is no "get" API above.  Add some material to
changelog?


It's used by later patches; even without a "get" API, it's a pretty 
fundamental functionality.



+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+   struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+struct kvm_gfn_range *range);


Looks if this Kconfig is on, the above two arch hooks won't have implementation.

Is it better to have two __weak empty versions here in this patch?

Anyway, from the changelog it seems it's not mandatory for some ARCH to provide
the above two if one wants to turn this on, i.e., the two hooks can be empty and
the ARCH can just use the __weak version.


I think this can be added by the first arch that needs memory attributes 
and also doesn't need one of these hooks.  Or perhaps the x86 
kvm_arch_pre_set_memory_attributes() could be made generic and thus that 
would be the __weak version.  It's too early to tell, so it's better to 
leave the implementation to the architectures for now.


Paolo



Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Paolo Bonzini

On 11/1/23 18:36, Sean Christopherson wrote:

A good example is KVM_RUN with -EINTR; if KVM were to return something other 
than
-EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall 
over.


And dually if KVM were to return KVM_EXIT_INTR together with something 
other than -EINTR.


And purging exit_reason super early is subtly tricky because KVM's 
(again, poorly documented) ABI is that *some* exit reasons are preserved 
across KVM_RUN with vcpu->run->immediate_exit (or with a pending 
signal). https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com


vcpu->run->immediate_exit preserves all exit reasons, but it's not a 
good idea that immediate_exit behaves different from a pending signal on 
entry to KVM_RUN (remember that immediate_exit was meant to be a better 
performing alternative to KVM_SET_SIGNAL_MASK).


In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR 
(perhaps even _should_, except that breaks selftests so at this point it 
is ABI).


Paolo



Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Paolo Bonzini

On 11/2/23 10:35, Huang, Kai wrote:

IIUC KVM can already handle the case of poisoned
page by sending signal to user app: 

	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 
			struct kvm_page_fault *fault)   
	{   
		...


if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(fault->slot, fault->gfn);
	return RET_PF_RETRY;  
	}

}


EHWPOISON is not implemented by this series, so it should be left out of 
the documentation.




Currently as mentioned above when
vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and
Qemu prints ...

...: Bad address


... which is nonsense.

If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry
a specific value for EPC to let Qemu know and Qemu can then do more reasonable
things.


Yes, that's a good idea that can be implemented on top.

Paolo



Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Paolo Bonzini

On 11/2/23 06:59, Binbin Wu wrote:



Add flags to "struct kvm_gfn_range" to let notifier events target
only shared and only private mappings, and write up the existing
mmu_notifier events to be shared-only (private memory is never
associated with a userspace virtual address, i.e. can't be reached
via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities
(shared, private, and shared+private) without needing something
like a tri-state enum.


I see the two flags are set/cleared in __kvm_handle_hva_range() in
this patch and kvm_handle_gfn_range() from the later patch 13/35, but
I didn't see they are used/read in this patch series if I didn't miss
anything.  How are they supposed to be used in KVM?


They are going to be used by SNP/TDX patches.

Paolo



Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Paolo Bonzini

On 10/31/23 23:39, David Matlack wrote:

Maybe can you sketch out how you see this proposal being extensible to
using guest_memfd for shared mappings?

For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  
What's
missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
ensure there are no outstanding references when converting back to private.

For TDX/SNP, assuming we don't find a performant and robust way to do in-place
conversions, a second fd+offset pair would be needed.

Is there a way to support non-in-place conversions within a single guest_memfd?


For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to 
guest memory.  The hook would invalidate now-private parts if they have 
a VMA, causing a SIGSEGV/EFAULT if the host touches them.


It would forbid mappings from multiple gfns to a single offset of the 
guest_memfd, because then the shared vs. private attribute would be tied 
to the offset.  This should not be a problem; for example, in the case 
of SNP, the RMP already requires a single mapping from host physical 
address to guest physical address.


Paolo



Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-02 Thread Paolo Bonzini
On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson  wrote:
> Actually, looking that this again, there's not actually a hard dependency on 
> THP.
> A THP-enabled kernel _probably_  gives a higher probability of using 
> hugepages,
> but mostly because THP selects COMPACTION, and I suppose because using THP for
> other allocations reduces overall fragmentation.

Yes, that's why I didn't even bother enabling it unless THP is
enabled, but it makes even more sense to just try.

> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I 
> think
> we should do the below (I verified KVM can create hugepages with THP=n).  
> We'll
> need another capability, but (a) we probably should have that anyways and (b) 
> it
> provides a cleaner path to adding PUD-sized hugepage support in the future.

I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
should be a generic kernel API and in fact the sizes are available in
a not-so-friendly format in /sys/kernel/mm/hugepages.

We should just add /sys/kernel/mm/hugepages/sizes that contains
"2097152 1073741824" on x86 (only the former if 1G pages are not
supported).

Plus: is this the best API if we need something else for 1G pages?

Let's drop *this* patch and proceed incrementally. (Again, this is
what I want to do with this final review: identify places that are
stil sticky, and don't let them block the rest).

Coincidentially we have an open spot next week at plumbers. Let's
extend Fuad's section to cover more guestmem work.

Paolo

> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
> b/tools/testing/selftests/kvm/guest_memfd_test.c
> index c15de9852316..c9f449718fce 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -201,6 +201,10 @@ int main(int argc, char *argv[])
>
> TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
>
> +   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && 
> thp_configured())
> +   TEST_ASSERT_EQ(get_trans_hugepagesz(),
> +  
> kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
> +
> page_size = getpagesize();
> total_size = page_size * 4;
>
> diff --git 
> a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
> b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> index be311944e90a..245901587ed2 100644
> --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> @@ -396,7 +396,7 @@ static void test_mem_conversions(enum 
> vm_mem_backing_src_type src_type, uint32_t
>
> vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << 
> KVM_HC_MAP_GPA_RANGE));
>
> -   if (backing_src_can_be_huge(src_type))
> +   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
> memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> else
> memfd_flags = 0;
>
> --
> From: Sean Christopherson 
> Date: Wed, 25 Oct 2023 16:26:41 -0700
> Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
>  memory
>
> Extend guest_memfd to allow backing guest memory with hugepages.  For now,
> make hugepage utilization best-effort, i.e. fall back to non-huge mappings
> if a hugepage can't be allocated.  Guaranteeing hugepages would require a
> dedicated memory pool and significantly more complexity and churn..
>
> Require userspace to opt-in via a flag even though it's unlikely real use
> cases will ever want to use order-0 pages, e.g. to give userspace a safety
> valve in case hugepage support is buggy, and to allow for easier testing
> of both paths.
>
> Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
> primarily deals with userspace page tables, which are explicitly not in
> play for guest_memfd.  Selecting THP does make obtaining hugepages more
> likely, but only because THP selects CONFIG_COMPACTION.  Don't select
> CONFIG_COMPACTION either, because again it's not a hard dependency.
>
> For simplicity, require the guest_memfd size to be a multiple of the
> hugepage size, e.g. so that KVM doesn't need to do bounds checking when
> deciding whether or not to allocate a huge folio.
>
> When reporting the max order when KVM gets a pfn from guest_memfd, force
> order-0 pages if the hugepage is not fully contained by the memslot
> binding, e.g. if userspace requested hugepages but punches a hole in the
> memslot bindings in order to emulate x86's VGA hole.
>
> Signed-off-by: Sean Christopherson 
> ---
>  Documentation/virt/kvm/api.rst | 17 +
>  include/uapi/linux/kvm.h   |  3 ++
>  virt/kvm/guest_memfd.c | 69 +-
>  virt/kvm/kvm_main.c|  4 ++
>  4 files changed, 84 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e82c69d5e755..ccdd5413920d 100644
> --- a/Documen

Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-11-03 Thread Paolo Bonzini

On 11/2/23 17:24, Christian Brauner wrote:

On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote:

Export anon_inode_getfile_secure() so that it can be used by KVM to create
and manage file-based guest memory without need a fullblow filesystem.
The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
needs a unique inode for each file, e.g. to be able to independently
manage the size and lifecycle of a given file.

Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
the name.

Signed-off-by: Sean Christopherson 
---


Before we enshrine this misleading name let's rename this to:

create_anon_inode_getfile()

I don't claim it's a great name but it's better than *_secure() which is
very confusing. So just:

struct file *create_anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags)


I slightly prefer anon_inode_create_getfile(); grepping include/linux 
for '\

Neither userfaultfd (which uses anon_inode_getfd_secure()) nor io_uring 
strictly speaking need separate inodes; they do want the call to 
inode_init_security_anon().  But I agree that the new name is better and 
I will adjust the comments so that it is clear why you'd use this 
function instead of anon_inode_get{file,fd}().



May also just remove that context_inode argument from the exported
function. The only other caller is io_uring. And neither it nor this
patchset need the context_inode thing afaict.


True, OTOH we might as well rename anon_inode_getfd_secure() to 
anon_inode_create_getfd(), and that one does need context_inode.


I'll Cc you on v14 and will carry the patch in my tree.

Paolo


Merge conflict risk is
extremely low so carrying that as part of this patchset is fine and
shouldn't cause huge issues for you.





Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory

2023-11-05 Thread Paolo Bonzini
On Sun, Nov 5, 2023 at 2:04 PM Xu Yilun  wrote:
>
> > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +   struct kvm_page_fault *fault)
> > +{
> > + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > +   PAGE_SIZE, fault->write, fault->exec,
> > +   fault->is_private);
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > +struct kvm_page_fault *fault)
> > +{
> > + int max_order, r;
> > +
> > + if (!kvm_slot_can_be_private(fault->slot)) {
> > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > + return -EFAULT;
> > + }
> > +
> > + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
> > +  &max_order);
> > + if (r) {
> > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > + return r;
>
> Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT?

The cases are EFAULT, EHWPOISON (which can report
KVM_EXIT_MEMORY_FAULT) and ENOMEM. I think it's fine
that even -ENOMEM can return KVM_EXIT_MEMORY_FAULT,
and it doesn't violate the documentation.  The docs tell you "what
can you do if error if EFAULT or EHWPOISON?"; they don't
exclude that other errnos result in KVM_EXIT_MEMORY_FAULT,
it's just that you're not supposed to look at it

Paolo



[PATCH v14 00/34] KVM: guest_memfd() and per-page attributes

2023-11-05 Thread Paolo Bonzini
dding AS_UNMOVABLE isn't strictly required as it's "just" an
optimization, but we'd prefer to have it in place straightaway.

If you would like to see a range-diff, I suggest using Patchew; start
from https://patchew.org/linux/20231027182217.3615211-1-sea...@google.com/
and click v14 on top.

Thanks,

Paolo

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  KVM: Introduce per-page memory attributes
  KVM: x86: Disallow hugepages when memory attributes are mixed
  KVM: x86/mmu: Handle page fault for private memory
  KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  KVM: selftests: Expand set_memory_region_test to validate
    guest_memfd()
  KVM: selftests: Add basic selftest for guest_memfd()

Paolo Bonzini (1):
  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()

Sean Christopherson (23):
  KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
ranges
  KVM: Assert that mmu_invalidate_in_progress *never* goes negative
  KVM: WARN if there are dangling MMU invalidations at VM destruction
  KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  KVM: Drop .on_unlock() mmu_notifier hook
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
memory
  KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  KVM: Allow arch code to track number of memslot address spaces per VM
  KVM: x86: Add support for "protected VMs" that can utilize private
memory
  KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  KVM: selftests: Convert lib's mem regions to
KVM_SET_USER_MEMORY_REGION2
  KVM: selftests: Add support for creating private memslots
  KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
type
  KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data
  KVM: selftests: Add a memory region subtest to validate invalid flags
  KVM: Prepare for handling only shared mappings in mmu_notifier events
  KVM: Add transparent hugepage support for dedicated guest memory

Vishal Annapurve (3):
  KVM: selftests: Add helpers to convert guest memory b/w private and
shared
  KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
(x86)
  KVM: selftests: Add x86-only selftest for private memory conversions


 Documentation/virt/kvm/api.rst| 209 +++
 arch/arm64/include/asm/kvm_host.h |   2 -
 arch/arm64/kvm/Kconfig|   2 +-
 arch/loongarch/include/asm/kvm_host.h |   1 -
 arch/loongarch/kvm/Kconfig|   2 +-
 arch/mips/include/asm/kvm_host.h  |   2 -
 arch/mips/kvm/Kconfig |   2 +-
 arch/powerpc/include/asm/kvm_host.h   |   2 -
 arch/powerpc/kvm/Kconfig  |   8 +-
 arch/powerpc/kvm/book3s_hv.c  |   2 +-
 arch/powerpc/kvm/powerpc.c|   7 +-
 arch/riscv/include/asm/kvm_host.h |   2 -
 arch/riscv/kvm/Kconfig|   2 +-
 arch/x86/include/asm/kvm_host.h   |  17 +-
 arch/x86/include/uapi/asm/kvm.h   |   3 +
 arch/x86/kvm/Kconfig  |  14 +-
 arch/x86/kvm/debugfs.c|   2 +-
 arch/x86/kvm/mmu/mmu.c| 271 +++-
 arch/x86/kvm/mmu/mmu_internal.h   |   2 +
 arch/x86/kvm/vmx/vmx.c|  11 +-
 arch/x86/kvm/x86.c|  26 +-
 fs/anon_inodes.c  |  47 +-
 fs/userfaultfd.c  |   5 +-
 include/linux/anon_inodes.h   |   4 +-
 include/linux/kvm_host.h  | 144 -
 include/linux/kvm_types.h |   1 +
 include/linux/pagemap.h   |  19 +-
 include/uapi/linux/kvm.h  |  51 ++
 io_uring/io_uring.c   |   3 +-
 mm/compaction.c   |  43 +-
 mm/migrate.c  |   2 +
 tools/testing/selftests/kvm/Makefile  |   3 +
 tools/testing/selftests/kvm/dirty_log_test.c  |   2 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 221 +++
 .../selftests/kvm/include/kvm_util_base.h | 148 -
 .../testing/selftests/kvm/include/test_util.h |   5 +
 .../selftests/kvm/include/ucall_common.h  |  11 +
 .../selftests/kvm

[PATCH 01/34] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges

2023-11-05 Thread Paolo Bonzini
From: Sean Christopherson 

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address.  Rename the
handler typedef too (arguably it should always have been gfn_handler_t).

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Xiaoyao Li 
Signed-off-by: Sean Christopherson 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-2-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..0524933856d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 unsigned long end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
-struct kvm_hva_range {
-   unsigned long start;
-   unsigned long end;
+struct kvm_mmu_notifier_range {
+   /*
+* 64-bit addresses, as KVM notifiers can operate on host virtual
+* addresses (unsigned long) and guest physical addresses (64-bit).
+*/
+   u64 start;
+   u64 end;
union kvm_mmu_notifier_arg arg;
-   hva_handler_t handler;
+   gfn_handler_t handler;
on_lock_fn_t on_lock;
on_unlock_fn_t on_unlock;
bool flush_on_ret;
@@ -581,7 +585,7 @@ static const union kvm_mmu_notifier_arg 
KVM_MMU_NOTIFIER_NO_ARG;
 node = interval_tree_iter_next(node, start, last))  \
 
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct kvm_hva_range 
*range)
+ const struct 
kvm_mmu_notifier_range *range)
 {
bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
@@ -608,9 +612,9 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
unsigned long hva_start, hva_end;
 
slot = container_of(node, struct kvm_memory_slot, 
hva_node[slots->node_idx]);
-   hva_start = max(range->start, slot->userspace_addr);
-   hva_end = min(range->end, slot->userspace_addr +
- (slot->npages << PAGE_SHIFT));
+   hva_start = max_t(unsigned long, range->start, 
slot->userspace_addr);
+   hva_end = min_t(unsigned long, range->end,
+   slot->userspace_addr + (slot->npages << 
PAGE_SHIFT));
 
/*
 * To optimize for the likely case where the address
@@ -660,10 +664,10 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
unsigned long start,
unsigned long end,
union kvm_mmu_notifier_arg arg,
-   hva_handler_t handler)
+   gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range range = {
+   const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.arg= arg,
@@ -680,10 +684,10 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
*mn,
 unsigned long start,
 unsigned long end,
-hva_handler_t handler)
+gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range range = {
+   const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.handler= handler,
@@ -771,7 +775,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
  

[PATCH 02/34] KVM: Assert that mmu_invalidate_in_progress *never* goes negative

2023-11-05 Thread Paolo Bonzini
From: Sean Christopherson 

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-3-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+   KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
 */
if (wake)
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-   BUG_ON(kvm->mmu_invalidate_in_progress < 0);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
-- 
2.39.1




[PATCH 03/34] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-11-05 Thread Paolo Bonzini
From: Chao Peng 

Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Cc: Xu Yilun 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Xu Yilun 
Message-Id: <20231027182217.3615211-4-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/mmu/mmu.c   | 10 ++
 arch/x86/kvm/vmx/vmx.c   | 11 +-
 include/linux/kvm_host.h | 33 +++---
 virt/kvm/kvm_main.c  | 43 +++-
 4 files changed, 66 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b0f01d605617..b2d916f786ca 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
  *
  * There are several ways to safely use this helper:
  *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
  *   consuming it.  In this case, mmu_lock doesn't need to be held during the
  *   lookup, but it does need to be held while checking the MMU notifier.
  *
@@ -4366,7 +4366,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
 
return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
@@ -6260,7 +6260,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
 
write_lock(&kvm->mmu_lock);
 
-   kvm_mmu_invalidate_begin(kvm, 0, -1ul);
+   kvm_mmu_invalidate_begin(kvm);
+
+   kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6270,7 +6272,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - 
gfn_start);
 
-   kvm_mmu_invalidate_end(kvm, 0, -1ul);
+   kvm_mmu_invalidate_end(kvm);
 
write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1..40e3780d73ae 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct 
kvm_vcpu *vcpu)
return;
 
/*
-* Grab the memslot so that the hva lookup for the mmu_notifier retry
-* is guaranteed to use the same memslot as the pfn lookup, i.e. rely
-* on the pfn lookup's validation of the memslot to ensure a valid hva
-* is used for the retry check.
+* Explicitly grab the memslot using KVM's internal slot ID to ensure
+* KVM doesn't unintentionally grab a userspace memslot.  It _should_
+* be impossible for userspace to create a memslot for the APIC when
+* APICv is enabled, but paranoia won't hurt in this case.
 */
slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu 
*vcpu)
return;
 
read_lock(&vcpu->kvm->mmu_lock);
-   if (mmu_invalidate_retry_hva(kvm, mmu_seq,
-gfn_to_hva_memslot(slot, gfn))) {
+   if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
read_unlock(&vcpu->kvm->mmu_lock);
goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..11d091688346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
-   unsigned long mmu_invalidate_range_start;
-   unsigned long mmu_invalidate_range_end;
+   gfn_t mmu_invalidate_range

[PATCH 04/34] KVM: WARN if there are dangling MMU invalidations at VM destruction

2023-11-05 Thread Paolo Bonzini
From: Sean Christopherson 

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: 
https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com
Signed-off-by: Sean Christopherson 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
Message-Id: <20231027182217.3615211-5-sea...@google.com>
Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9cc57b23ec81..5422ce20dcba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1358,9 +1358,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
 * No threads can be waiting in kvm_swap_active_memslots() as the
 * last reference on KVM has been dropped, but freeing
 * memslots would deadlock without this manual intervention.
+*
+* If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
+* notifier between a start() and end(), then there shouldn't be any
+* in-progress invalidations.
 */
WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
-   kvm->mn_active_invalidate_count = 0;
+   if (kvm->mn_active_invalidate_count)
+   kvm->mn_active_invalidate_count = 0;
+   else
+   WARN_ON(kvm->mmu_invalidate_in_progress);
 #else
kvm_flush_shadow_all(kvm);
 #endif
-- 
2.39.1




  1   2   3   4   >