date:20231030

Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-10-30 Thread Mike Rapoport

On Thu, Oct 26, 2023 at 11:24:39AM +0100, Will Deacon wrote:
> On Thu, Oct 26, 2023 at 11:58:00AM +0300, Mike Rapoport wrote:
> > On Mon, Oct 23, 2023 at 06:14:20PM +0100, Will Deacon wrote:
> > > On Mon, Sep 18, 2023 at 10:29:46AM +0300, Mike Rapoport wrote:
> > > > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > > > index dd851297596e..cd6320de1c54 100644
> > > > --- a/arch/arm64/kernel/module.c
> > > > +++ b/arch/arm64/kernel/module.c

...

> > > > -   if (module_direct_base) {
> > > > -   p = __vmalloc_node_range(size, MODULE_ALIGN,
> > > > -module_direct_base,
> > > > -module_direct_base + SZ_128M,
> > > > -GFP_KERNEL | __GFP_NOWARN,
> > > > -PAGE_KERNEL, 0, NUMA_NO_NODE,
> > > > -__builtin_return_address(0));
> > > > -   }
> > > > +   module_init_limits();
> > > 
> > > Hmm, this used to be run from subsys_initcall(), but now you're running
> > > it _really_ early, before random_init(), so randomization of the module
> > > space is no longer going to be very random if we don't have early entropy
> > > from the firmware or the CPU, which is likely to be the case on most SoCs.
> > 
> > Well, it will be as random as KASLR. Won't that be enough?
> 
> I don't think that's true -- we have the 'kaslr-seed' property for KASLR,
> but I'm not seeing anything like that for the module randomisation and I
> also don't see why we need to set these limits so early.

x86 needs execmem initialized before ftrace_init() so I thought it would be
best to setup execmem along with most of MM in mm_core_init().

I'll move execmem initialization for !x86 to a later point, say
core_initcall.
 
> Will

-- 
Sincerely yours,
Mike.

Re: [PATCH 00/10] Remove obsolete and orphaned wifi drivers

2023-10-30 Thread John Paul Adrian Glaubitz

Hi Arnd!

There is some non-x86 hardware like the Amiga that still uses PCMCIA-style 
networking
cards on machines like the A600 and A1200. So, unless these drivers are 
actually causing
problems, I would rather not see them go yet.

Thanks,
Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v12 6/6] powerpc: add crash memory hotplug support

2023-10-30 Thread Sourabh Jain


Hello,

On 30/10/23 06:13, kernel test robot wrote:

Hi Sourabh,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.6-rc7]
[cannot apply to powerpc/next powerpc/fixes tip/x86/core next-20231027]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Sourabh-Jain/crash-forward-memory_notify-arg-to-arch-crash-hotplug-handler/20231029-213858
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git 
mm-everything
patch link:
https://lore.kernel.org/r/20231029124039.6158-7-sourabhjain%40linux.ibm.com
patch subject: [PATCH v12 6/6] powerpc: add crash memory hotplug support
config: powerpc64-randconfig-001-20231029 
(https://download.01.org/0day-ci/archive/20231030/202310300812.yn6fupcz-...@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231030/202310300812.yn6fupcz-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202310300812.yn6fupcz-...@intel.com/

All errors (new ones prefixed by >>):

powerpc64-linux-ld: warning: discarding dynamic section .glink
powerpc64-linux-ld: warning: discarding dynamic section .plt
powerpc64-linux-ld: linkage table error against `add_opal_mem_range'
powerpc64-linux-ld: stubs don't match calculated size
powerpc64-linux-ld: can not build stubs: bad value
powerpc64-linux-ld: arch/powerpc/kexec/core_64.o: in function 
`get_crash_memory_ranges':
core_64.c:(.text+0x1010): undefined reference to `add_mem_range'
powerpc64-linux-ld: core_64.c:(.text+0x1064): undefined reference to 
`sort_memory_ranges'
powerpc64-linux-ld: core_64.c:(.text+0x1128): undefined reference to 
`add_rtas_mem_range'
powerpc64-linux-ld: core_64.c:(.text+0x1140): undefined reference to 
`add_opal_mem_range'
powerpc64-linux-ld: core_64.c:(.text+0x1160): undefined reference to 
`add_mem_range'
powerpc64-linux-ld: core_64.c:(.text+0x1188): undefined reference to 
`sort_memory_ranges'
powerpc64-linux-ld: core_64.c:(.text+0x11e0): undefined reference to 
`realloc_mem_ranges'
powerpc64-linux-ld: arch/powerpc/kexec/core_64.o: in function 
`arch_crash_handle_hotplug_event':

core_64.c:(.text+0x1900): undefined reference to `remove_mem_range'


The patch series uses a couple of functions defined in 
arch/powerpc/kexec/ranges.c. Since this file
only builds when CONFIG_KEXEC_FILE is set, we are encountering this 
build issue.


If CONFIG_KEXEC_FILE is set, then this issue is not observed.

I will fix the above warnings and post the next version.

Thanks,
Sourabh

Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-10-30 Thread Chao Gao

On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
>From: Chao Peng 
>
>In confidential computing usages, whether a page is private or shared is
>necessary information for KVM to perform operations like page fault
>handling, page zapping etc. There are other potential use cases for
>per-page memory attributes, e.g. to make memory read-only (or no-exec,
>or exec-only, etc.) without having to modify memslots.
>
>Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
>userspace to operate on the per-page memory attributes.
>  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>a guest memory range.

>  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>memory attributes.

This ioctl() is already removed. So, the changelog is out-of-date and needs
an update.

>
>+
>+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
>+:Architectures: x86
>+:Type: vm ioctl
>+:Parameters: struct kvm_memory_attributes(in)

   ^ add one space here?


>+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
>+struct kvm_gfn_range *range)
>+{
>+  /*
>+   * Unconditionally add the range to the invalidation set, regardless of
>+   * whether or not the arch callback actually needs to zap SPTEs.  E.g.
>+   * if KVM supports RWX attributes in the future and the attributes are
>+   * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
>+   * adding the range allows KVM to require that MMU invalidations add at
>+   * least one range between begin() and end(), e.g. allows KVM to detect
>+   * bugs where the add() is missed.  Rexlaing the rule *might* be safe,

 Relaxing

>@@ -4640,6 +4850,17 @@ static int kvm_vm_ioctl_check_extension_generic(struct 
>kvm *kvm, long arg)
>   case KVM_CAP_BINARY_STATS_FD:
>   case KVM_CAP_SYSTEM_EVENT_DATA:
>   return 1;
>+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>+  case KVM_CAP_MEMORY_ATTRIBUTES:
>+  u64 attrs = kvm_supported_mem_attributes(kvm);
>+
>+  r = -EFAULT;
>+  if (copy_to_user(argp, &attrs, sizeof(attrs)))
>+  goto out;
>+  r = 0;
>+  break;

This cannot work, e.g., no @argp in this function and is fixed by a later 
commit:

fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for 
guest-specific backing memory")

Re: [PATCH 00/10] Remove obsolete and orphaned wifi drivers

2023-10-30 Thread Arnd Bergmann

On Mon, Oct 30, 2023, at 08:19, John Paul Adrian Glaubitz wrote:
> Hi Arnd!
>
> There is some non-x86 hardware like the Amiga that still uses 
> PCMCIA-style networking
> cards on machines like the A600 and A1200. So, unless these drivers are 
> actually causing
> problems, I would rather not see them go yet.

Do you know of any systems other than the Amiga that are still
in use with new kernels and that rely on PCMCIA networking?

I know that Amiga has its own simple CONFIG_AMIGA_PCMCIA
implementation that is incompatible with CONFIG_PCMCIA device
drivers, so it is not affected by this.

For the few ARM systems that still support a CF card slot,
these tend to be used for IDE type storage cards because their
internal flash is too limited otherwise.

   Arnd

Re: [RFC PATCH v8 00/13] Add audio support in v4l2 framework

2023-10-30 Thread Daniel Baluta

On Mon, Oct 30, 2023 at 3:56 AM Shengjiu Wang  wrote:
>
> On Fri, Oct 27, 2023 at 7:18 PM Hans Verkuil  wrote:
> >
> > Hi Shengjiu,
> >
> > Is there a reason why this series is still marked RFC?
> >
> > Just wondering about that.
>
> In the very beginning I started this series with RFC, So
> I still use RFC now...

I think we are way past the RFC stage so if there will be another
version, it is better to remove the RFC tag.

RFC means that this patch series is not ready to be merged but is
merely in the incipient stages of development.

thanks,
Daniel.

Re: [PATCH v7 1/5] powerpc/code-patching: introduce patch_instructions()

2023-10-30 Thread Aneesh Kumar K.V

Hari Bathini  writes:

> patch_instruction() entails setting up pte, patching the instruction,
> clearing the pte and flushing the tlb. If multiple instructions need
> to be patched, every instruction would have to go through the above
> drill unnecessarily. Instead, introduce patch_instructions() function
> that sets up the pte, clears the pte and flushes the tlb only once
> per page range of instructions to be patched. Duplicate most of the
> patch_instruction() code instead of merging with it, to avoid the
> performance degradation observed on ppc32, for patch_instruction(),
> with the code path merged. Also, setup poking_init() always as BPF
> expects poking_init() to be setup even when STRICT_KERNEL_RWX is off.
>
> Signed-off-by: Hari Bathini 
> Acked-by: Song Liu 
>

A lot of this is duplicate of patch_instruction(). Can we consolidate
thing between them? 

> ---
>
> Changes in v7:
> * Fixed crash observed with !STRICT_RWX.
>
>
>  arch/powerpc/include/asm/code-patching.h |   1 +
>  arch/powerpc/lib/code-patching.c | 141 ++-
>  2 files changed, 139 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/code-patching.h 
> b/arch/powerpc/include/asm/code-patching.h
> index 3f881548fb61..0e29ccf903d0 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -74,6 +74,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr,
>  int patch_branch(u32 *addr, unsigned long target, int flags);
>  int patch_instruction(u32 *addr, ppc_inst_t instr);
>  int raw_patch_instruction(u32 *addr, ppc_inst_t instr);
> +int patch_instructions(u32 *addr, u32 *code, size_t len, bool repeat_instr);
>  
>  static inline unsigned long patch_site_addr(s32 *site)
>  {
> diff --git a/arch/powerpc/lib/code-patching.c 
> b/arch/powerpc/lib/code-patching.c
> index b00112d7ad46..e1c1fd9246d8 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -204,9 +204,6 @@ void __init poking_init(void)
>  {
>   int ret;
>  
> - if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
> - return;
> -
>   if (mm_patch_enabled())
>   ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
>   "powerpc/text_poke_mm:online",
> @@ -378,6 +375,144 @@ int patch_instruction(u32 *addr, ppc_inst_t instr)
>  }
>  NOKPROBE_SYMBOL(patch_instruction);
>  
> +static int __patch_instructions(u32 *patch_addr, u32 *code, size_t len, bool 
> repeat_instr)
> +{
> + unsigned long start = (unsigned long)patch_addr;
> +
> + /* Repeat instruction */
> + if (repeat_instr) {
> + ppc_inst_t instr = ppc_inst_read(code);
> +
> + if (ppc_inst_prefixed(instr)) {
> + u64 val = ppc_inst_as_ulong(instr);
> +
> + memset64((u64 *)patch_addr, val, len / 8);
> + } else {
> + u32 val = ppc_inst_val(instr);
> +
> + memset32(patch_addr, val, len / 4);
> + }
> + } else {
> + memcpy(patch_addr, code, len);
> + }
> +
> + smp_wmb();  /* smp write barrier */
> + flush_icache_range(start, start + len);
> + return 0;
> +}
> +
> +/*
> + * A page is mapped and instructions that fit the page are patched.
> + * Assumes 'len' to be (PAGE_SIZE - offset_in_page(addr)) or below.
> + */
> +static int __do_patch_instructions_mm(u32 *addr, u32 *code, size_t len, bool 
> repeat_instr)
> +{
> + struct mm_struct *patching_mm, *orig_mm;
> + unsigned long pfn = get_patch_pfn(addr);
> + unsigned long text_poke_addr;
> + spinlock_t *ptl;
> + u32 *patch_addr;
> + pte_t *pte;
> + int err;
> +
> + patching_mm = __this_cpu_read(cpu_patching_context.mm);
> + text_poke_addr = __this_cpu_read(cpu_patching_context.addr);
> + patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
> +
> + pte = get_locked_pte(patching_mm, text_poke_addr, &ptl);
> + if (!pte)
> + return -ENOMEM;
> +
> + __set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, 
> PAGE_KERNEL), 0);
> +
> + /* order PTE update before use, also serves as the hwsync */
> + asm volatile("ptesync" ::: "memory");
> +
> + /* order context switch after arbitrary prior code */
> + isync();
> +
> + orig_mm = start_using_temp_mm(patching_mm);
> +
> + err = __patch_instructions(patch_addr, code, len, repeat_instr);
> +
> + /* context synchronisation performed by __patch_instructions */
> + stop_using_temp_mm(patching_mm, orig_mm);
> +
> + pte_clear(patching_mm, text_poke_addr, pte);
> + /*
> +  * ptesync to order PTE update before TLB invalidation done
> +  * by radix__local_flush_tlb_page_psize (in _tlbiel_va)
> +  */
> + local_flush_tlb_page_psize(patching_mm, text_poke_addr, 
> mmu_virtual_psize);
> +
> + pte_unmap_unlock(pte, ptl);
> +
> + return err;
> +}
> +
> +/*
> + *

Re: [PATCH 02/12] powerpc/pseries: Restructure hvc_get_chars() endianness

2023-10-30 Thread Aneesh Kumar K.V

Benjamin Gray  writes:

> Sparse reports an endian mismatch in hvc_get_chars().
>
> At first it seemed like the retbuf should be __be64[], but actually
> retbuf holds serialized registers returned by the hypervisor call, so
> it's correctly CPU endian typed.
>
> Instead, it is the be64_to_cpu() that's misleading. The intent is to do
> a byte swap on a little endian CPU. The swap is required because we
> wanted to store the register values to memory without 'swapping' bytes,
> so that the high order byte of the first register is the first byte
> in the result buffer.
>
> In diagram form, on a LE CPU with the letters representing the return
> string bytes:
>
> (register bytes) A B C D E F G H   I J K L M N O P
>   (retbuf mem bytes) H G F E D C B A   P O N M L K J I
> (buf/lbuf mem bytes) A B C D E F G H   I J K L M N O P
>
> So retbuf stores the registers in CPU endian, and buf always stores in
> big endian.
>
> So replace the byte swap function with cpu_to_be64() and cast lbuf as an
> array of __be64 to match the semantics closer.
>
> Signed-off-by: Benjamin Gray 
> ---
>  arch/powerpc/platforms/pseries/hvconsole.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hvconsole.c 
> b/arch/powerpc/platforms/pseries/hvconsole.c
> index 1ac52963e08b..647718a15e78 100644
> --- a/arch/powerpc/platforms/pseries/hvconsole.c
> +++ b/arch/powerpc/platforms/pseries/hvconsole.c
> @@ -29,11 +29,11 @@ int hvc_get_chars(uint32_t vtermno, char *buf, int count)
>  {
>   long ret;
>   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> - unsigned long *lbuf = (unsigned long *)buf;
> + __be64 *lbuf = (__be64 __force *)buf;
>  
>   ret = plpar_hcall(H_GET_TERM_CHAR, retbuf, vtermno);
> - lbuf[0] = be64_to_cpu(retbuf[1]);
> - lbuf[1] = be64_to_cpu(retbuf[2]);
> + lbuf[0] = cpu_to_be64(retbuf[1]);
> + lbuf[1] = cpu_to_be64(retbuf[2]);
>  
>   if (ret == H_SUCCESS)
>   return retbuf[0];
>

There is no functionality change in this patch. It is clarifying the
details that it expect the buf to have the big-endian format and retbuf
contains native endian format.

Not sure why this was not picked.

Reviewed-by: Aneesh Kumar K.V

Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Chao Gao wrote:
> On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
> >From: Chao Peng 
> >
> >In confidential computing usages, whether a page is private or shared is
> >necessary information for KVM to perform operations like page fault
> >handling, page zapping etc. There are other potential use cases for
> >per-page memory attributes, e.g. to make memory read-only (or no-exec,
> >or exec-only, etc.) without having to modify memslots.
> >
> >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> >userspace to operate on the per-page memory attributes.
> >  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >a guest memory range.
> 
> >  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >memory attributes.
> 
> This ioctl() is already removed. So, the changelog is out-of-date and needs
> an update.

Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below.

> >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> >+:Architectures: x86
> >+:Type: vm ioctl
> >+:Parameters: struct kvm_memory_attributes(in)
> 
>  ^ add one space here?

Ah, yeah, that does appear to be the standard.
> 
> 
> >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
> >+  struct kvm_gfn_range *range)
> >+{
> >+/*
> >+ * Unconditionally add the range to the invalidation set, regardless of
> >+ * whether or not the arch callback actually needs to zap SPTEs.  E.g.
> >+ * if KVM supports RWX attributes in the future and the attributes are
> >+ * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
> >+ * adding the range allows KVM to require that MMU invalidations add at
> >+ * least one range between begin() and end(), e.g. allows KVM to detect
> >+ * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
> 
>    Relaxing
> 
> >@@ -4640,6 +4850,17 @@ static int 
> >kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > case KVM_CAP_BINARY_STATS_FD:
> > case KVM_CAP_SYSTEM_EVENT_DATA:
> > return 1;
> >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> >+case KVM_CAP_MEMORY_ATTRIBUTES:
> >+u64 attrs = kvm_supported_mem_attributes(kvm);
> >+
> >+r = -EFAULT;
> >+if (copy_to_user(argp, &attrs, sizeof(attrs)))
> >+goto out;
> >+r = 0;
> >+break;
> 
> This cannot work, e.g., no @argp in this function and is fixed by a later 
> commit:
> 
>   fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for 
> guest-specific backing memory")

I'll post a fixup patch for all of these, thanks much!

Re: Several kmemleak reports + "refcount_t: underflow; use-after-free" at boot when OF_UNITTEST + OF_OVERLAY is set (Kernel v6.6-rc6, PowerMac G5 11,2)

2023-10-30 Thread Rob Herring

On Wed, Oct 18, 2023 at 4:38 PM Erhard Furtner  wrote:
>
> Greetings!
>
> Getting this at every boot on my G5 with kernel v6.6-rc6 with OF_UNITTEST and 
> OF_OVERLAY selected:
>
> [...]
> ### dt-test ### EXPECT \ : OF: ERROR: of_node_release() detected bad
> of_node_put() on /testcase-data/refcount-node ### dt-test ### pass
> of_unittest_lifecycle():3189 OF: ERROR: of_node_release() detected bad
> of_node_put() on /testcase-data/refcount-node ### dt-test ### EXPECT / : OF:
> ERROR: of_node_release() detected bad of_node_put() on
> /testcase-data/refcount-node ### dt-test ### EXPECT \ : [ cut here
> ] ### dt-test ### EXPECT \ : WARNING: <> ### dt-test ###
> EXPECT \ : refcount_t: underflow; use-after-free. ### dt-test ### EXPECT \ :

The test tells you to expect a use-after-free...

> ---[ end trace <> ]--- ### dt-test ### pass of_unittest_lifecycle():3209
> [ cut here ]
> refcount_t: underflow; use-after-free.

Then you get a use-after-free. Looks like it is working as designed.

I believe it's the same with kmemleak.

Note that running DT unittests also taints the kernel. That's because
they are not meant to be run on a production system.

Rob

Re: [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress never goes negative

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().


Reviewed-by: Paolo Bonzini 


Signed-off-by: Sean Christopherson 
---
  virt/kvm/kvm_main.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+   KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
  }
  
  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
 */
if (wake)
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
-
-   BUG_ON(kvm->mmu_invalidate_in_progress < 0);
  }
  
  static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:
From: Chao Peng  Currently in mmu_notifier 
invalidate path, hva range is recorded and then checked against by 
mmu_notifier_retry_hva() in the page fault handling path. However, for 
the to be introduced private memory, a page fault may not have a hva 
associated, checking gfn(gpa) makes more sense. For existing hva based 
shared memory, gfn is expected to also work. The only downside is when 
aliasing multiple gfns to a single hva, the current algorithm of 
checking multiple ranges could result in a much larger range being 
rejected. Such aliasing should be uncommon, so the impact is expected 
small.


Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:
Add an assertion that there are no in-progress MMU invalidations when a 
VM is being destroyed, with the exception of the scenario where KVM 
unregisters its MMU notifier between an .invalidate_range_start() call 
and the corresponding .invalidate_range_end(). KVM can't detect unpaired 
calls from the mmu_notifier due to the above exception waiver, but the 
assertion can detect KVM bugs, e.g. such as the bug that *almost* 
escaped initial guest_memfd development.


Link: 
https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com
Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

   #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson 
---
  arch/powerpc/kvm/powerpc.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
break;
  #endif
case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   BUILD_BUG();
+#endif
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-   r = 1;
  #else
-   r = 0;
+   r = 1;
  #endif
break;
  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE


Reviewed-by: Paolo Bonzini

Re: [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.


The new #define should also imply KVM_CAP_SYNC_MMU, or even: 
KVM_CAP_SYNC_MMU should just be enabled by all architectures at this 
point.  You don't need to care about it, I have a larger series for caps 
that are enabled by all architectures and I'll post it for 6.8.


Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:


+   if (ioctl == KVM_SET_USER_MEMORY_REGION)
+   size = sizeof(struct kvm_userspace_memory_region);


This also needs a memset(&mem, 0, sizeof(mem)), otherwise the 
out-of-bounds access of the commit message becomes a kernel stack read.


Probably worth adding a check on valid flags here.

Paolo


+   else
+   size = sizeof(struct kvm_userspace_memory_region2);
+
+   /* Ensure the common parts of the two structs are identical. */
+   SANITY_CHECK_MEM_REGION_FIELD(slot);
+   SANITY_CHECK_MEM_REGION_FIELD(flags);
+   SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+   SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+   SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread David Matlack

On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> From: Chao Peng 
> 
> Currently in mmu_notifier invalidate path, hva range is recorded and
> then checked against by mmu_notifier_retry_hva() in the page fault
> handling path. However, for the to be introduced private memory, a page
  

Is there a missing word here?

> fault may not have a hva associated, checking gfn(gpa) makes more sense.
> 
> For existing hva based shared memory, gfn is expected to also work. The
> only downside is when aliasing multiple gfns to a single hva, the
> current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
> 
> Suggested-by: Sean Christopherson 
> Cc: Xu Yilun 
> Signed-off-by: Chao Peng 
> Reviewed-by: Fuad Tabba 
> Tested-by: Fuad Tabba 
> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/mmu/mmu.c   | 10 ++
>  arch/x86/kvm/vmx/vmx.c   | 11 +--
>  include/linux/kvm_host.h | 33 
>  virt/kvm/kvm_main.c  | 41 +++-
>  4 files changed, 64 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f7901cb4d2fa..d33657d61d80 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
> u64 *sptep)
>   *
>   * There are several ways to safely use this helper:
>   *
> - * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, 
> before
> + * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, 
> before
>   *   consuming it.  In this case, mmu_lock doesn't need to be held during the
>   *   lookup, but it does need to be held while checking the MMU notifier.
>   *
> @@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>   return true;
>  
>   return fault->slot &&
> -mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
> +mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> *fault)
> @@ -6245,7 +6245,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> gfn_start, gfn_t gfn_end)
>  
>   write_lock(&kvm->mmu_lock);
>  
> - kvm_mmu_invalidate_begin(kvm, 0, -1ul);
> + kvm_mmu_invalidate_begin(kvm);
> +
> + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>  
>   flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6255,7 +6257,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> gfn_start, gfn_t gfn_end)
>   if (flush)
>   kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - 
> gfn_start);
>  
> - kvm_mmu_invalidate_end(kvm, 0, -1ul);
> + kvm_mmu_invalidate_end(kvm);
>  
>   write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 72e3943f3693..6e502ba93141 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct 
> kvm_vcpu *vcpu)
>   return;
>  
>   /*
> -  * Grab the memslot so that the hva lookup for the mmu_notifier retry
> -  * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> -  * on the pfn lookup's validation of the memslot to ensure a valid hva
> -  * is used for the retry check.
> +  * Explicitly grab the memslot using KVM's internal slot ID to ensure
> +  * KVM doesn't unintentionally grab a userspace memslot.  It _should_
> +  * be impossible for userspace to create a memslot for the APIC when
> +  * APICv is enabled, but paranoia won't hurt in this case.
>*/
>   slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
>   if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> @@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct 
> kvm_vcpu *vcpu)
>   return;
>  
>   read_lock(&vcpu->kvm->mmu_lock);
> - if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> -  gfn_to_hva_memslot(slot, gfn))) {
> + if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
>   kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>   read_unlock(&vcpu->kvm->mmu_lock);
>   goto out;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fb6c6109fdca..11d091688346 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -787,8 +787,8 @@ struct kvm {
>   struct mmu_notifier mmu_notifier;
>   unsigned long mmu_invalidate_seq;
>   long mmu_invalidate_in_progress;
> - unsigned long mmu_invalidate_range_start;
> - unsigned long mmu_invalidate_range_end;
> +

Re: [PATCH 00/10] Remove obsolete and orphaned wifi drivers

2023-10-30 Thread Kalle Valo

John Paul Adrian Glaubitz  writes:

> There is some non-x86 hardware like the Amiga that still uses
> PCMCIA-style networking cards on machines like the A600 and A1200. So,
> unless these drivers are actually causing problems, I would rather not
> see them go yet.

There is a cost maintaining these drivers so I would like to see more
information in the report, at least:

* exact driver and hardware used

* last kernel version tested

* kernel log messages from a successful case

Here's a good example:

https://lore.kernel.org/linux-wireless/20231024014302.0a0b79b0@mir/

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread Paolo Bonzini

On Mon, Oct 30, 2023 at 5:53 PM David Matlack  wrote:
>
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > From: Chao Peng 
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
>   
>
> Is there a missing word here?

No but there could be missing hyphens ("for the to-be-introduced
private memory"); possibly a "soon" could help parsing and that is
what you were talking about?

> >   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > + kvm->mmu_invalidate_range_end = INVALID_GPA;
>
> I don't think this is incorrect, but I was a little suprised to see this
> here rather than in end() when mmu_invalidate_in_progress decrements to
> 0.

I think that would be incorrect on the very first start?

> > + }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > + lockdep_assert_held_write(&kvm->mmu_lock);
>
> Does this compile/function on KVM architectures with
> !KVM_HAVE_MMU_RWLOCK?

Yes:

#define lockdep_assert_held_write(l)\
lockdep_assert(lockdep_is_held_type(l, 0))

where 0 is the lock-held type used by lock_acquire_exclusive. In turn
is what you get for a spinlock or mutex, in addition to a rwlock or
rwsem that is taken for write.

Instead, lockdep_assert_held() asserts that the lock is taken without
asserting a particular lock-held type.

> > @@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned 
> > long start,
>
> Let's add a lockdep_assert_held_write(&kvm->mmu_lock) here too while
> we're at it?

Yes, good idea.

Paolo

Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).


Could have been split in two patches, but it's fine anyway.

Reviewed-by: Paolo Bonzini 

Paolo

Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.


This is what David suggested to do in patch 3, FWIW.

Reviewed-by: Paolo Bonzini 

Paolo


Reported-by: Isaku Yamahata 
Link: https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson 
---

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.


It's confusing who "they" is.  Maybe

 * HVA-based notifications provide a userspace address,
 * and as such are only relevant for shared mappings.

Paolo


+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
 
 			/*

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

   - explicit conversion: happens when the guest explicitly calls into KVM
 to map a range (as private or shared)

   - implicit conversion: happens when the guest attempts to access a gfn
 that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com
Cc: Anish Moorthy 
Cc: David Matlack 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 


---
  Documentation/virt/kvm/api.rst | 41 ++
  arch/x86/kvm/x86.c |  1 +
  include/linux/kvm_host.h   | 11 +
  include/uapi/linux/kvm.h   |  8 +++
  4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ace984acc125..860216536810 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6723,6 +6723,26 @@ array field represents return values. The userspace 
should update the return
  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
  spec refer, https://github.com/riscv/riscv-sbi-doc.
  
+::

+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
  ::
  
  /* KVM_EXIT_NOTIFY */

@@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that 
malicious VMs can
  cause CPU stuck (due to event windows don't open up) and make the CPU
  unavailable to host or other VMs.
  
+7.34 KVM_CAP_MEMORY_FAULT_INFO

+--
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the

Re: [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox 
Co-developed-by: Vlastimil Babka 
Signed-off-by: Vlastimil Babka 
Signed-off-by: Sean Christopherson 


I think this could even be "From: Vlastimil", but no biggie.

Paolo


---
  include/linux/pagemap.h | 19 +-
  mm/compaction.c | 43 +
  mm/migrate.c|  2 ++
  3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
-   AS_RELEASE_ALWAYS,  /* Call ->release_folio(), even if no private 
data */
+   AS_RELEASE_ALWAYS = 7,  /* Call ->release_folio(), even if no private 
data */
+   AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */
  };
  
  /**

@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct 
address_space *mapping)
clear_bit(AS_RELEASE_ALWAYS, &mapping->flags);
  }
  
+static inline void mapping_set_unmovable(struct address_space *mapping)

+{
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, &mapping->flags);
+   set_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+   return test_bit(AS_UNMOVABLE, &mapping->flags);
+}
+
  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
  {
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  
  	/* Time to isolate some pages for migration */

for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
  
  		if (skip_on_failure && low_pfn >= next_skip_pfn) {

/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
  
+		is_unevictable = folio_test_unevictable(folio);

+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
  
  		/*

@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
  
-		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {

-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+   (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
  
  			/*

 * Only folios without mappings or that have
-* a ->migrate_folio callback are possible to
-* migrate without blocking.  However, we may
-* be racing with truncation, which can free
-* the mapping.  Truncation holds the folio lock
-* until after the folio is removed from the page
-* cache so holding it ourselves is sufficient.
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+* holds the folio loc

Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:
Export anon_inode_getfile_secure() so that it can be used by KVM to 
create and manage file-based guest memory without need a fullblow 


without introducing a full-blown

Otherwise,

Reviewed-by: Paolo Bonzini 

Paolo

filesystem. The "standard" anon_inode_getfd() doesn't work for KVM's use 
case as KVM needs a unique inode for each file, e.g. to be able to 
independently manage the size and lifecycle of a given file.

Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:22, Sean Christopherson wrote:

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/x86.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee3cd8c3c0ef..f41dbb1465a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
  {
int r;
  
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;

vcpu->arch.l1tf_flush_l1d = true;
  
  	for (;;) {


Reviewed-by: Paolo Bonzini

Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:22, Sean Christopherson wrote:

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).


Reviewed-by: Paolo Bonzini

Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:22, Sean Christopherson wrote:

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson 


Reviewed-by: Paolo Bonzini 

with one nit:


+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n


s/support/supported/

Paolo

Re: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes

2023-10-30 Thread Paolo Bonzini


On 10/27/23 20:21, Sean Christopherson wrote:

Non-KVM people, please take a gander at two small-ish patches buried in the
middle of this series:

   fs: Export anon_inode_getfile_secure() for use by KVM
   mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!)
would be much appreciated.  Note, adding AS_UNMOVABLE isn't strictly required as
it's "just" an optimization, but we'd prefer to have it in place straightaway.


Reporting what I wrote in the other thread, for wider distribution:

I'm going to wait a couple days more for reviews to come in, post a v14
myself, and apply the series to kvm/next as soon as Linus merges the 6.7
changes.  The series will be based on the 6.7 tags/for-linus, and when
6.7-rc1 comes up, I'll do this to straighten the history:

git checkout kvm/next
git tag -s -f kvm-gmem HEAD
git reset --hard v6.7-rc1
git merge tags/kvm-gmem
# fix conflict with Christian Brauner's VFS series
git commit
git push kvm

6.8 is not going to be out for four months, and I'm pretty sure that
anything that would be discovered within "a few weeks" can also be
applied on top, and the heaviness of a 35-patch series will outweigh any
imperfections by a long margin.

(Full disclosure: this is _also_ because I want to apply this series to
the RHEL kernel, and Red Hat has a high level of disdain for
non-upstream patches.  But it's mostly because I want all dependencies
to be able to move on and be developed on top of stock kvm/next).

Paolo

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread David Matlack

On Mon, Oct 30, 2023 at 9:53 AM David Matlack  wrote:
>
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > From: Chao Peng 
> >
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
>
> What is the reason to separate range_add() from begin()?

Nevermind, I see how it's needed in kvm_mmu_unmap_gfn_range().

Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-30 Thread David Matlack

On Mon, Oct 30, 2023 at 10:01 AM Paolo Bonzini  wrote:
>
> On Mon, Oct 30, 2023 at 5:53 PM David Matlack  wrote:
> >
> > On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > > From: Chao Peng 
> > >
> > > Currently in mmu_notifier invalidate path, hva range is recorded and
> > > then checked against by mmu_notifier_retry_hva() in the page fault
> > > handling path. However, for the to be introduced private memory, a page
> >   
> >
> > Is there a missing word here?
>
> No but there could be missing hyphens ("for the to-be-introduced
> private memory"); possibly a "soon" could help parsing and that is
> what you were talking about?

Ah that explains it :)

>
> > >   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > > + kvm->mmu_invalidate_range_start = INVALID_GPA;
> > > + kvm->mmu_invalidate_range_end = INVALID_GPA;
> >
> > I don't think this is incorrect, but I was a little suprised to see this
> > here rather than in end() when mmu_invalidate_in_progress decrements to
> > 0.
>
> I think that would be incorrect on the very first start?

Good point. KVM could initialize start/end before registering
notifiers, but that's extra code.

Re: [PATCH v7 1/5] powerpc/code-patching: introduce patch_instructions()

2023-10-30 Thread Hari Bathini


Hi Aneesh,

On 30/10/23 6:32 pm, Aneesh Kumar K.V wrote:

Hari Bathini  writes:


patch_instruction() entails setting up pte, patching the instruction,
clearing the pte and flushing the tlb. If multiple instructions need
to be patched, every instruction would have to go through the above
drill unnecessarily. Instead, introduce patch_instructions() function
that sets up the pte, clears the pte and flushes the tlb only once
per page range of instructions to be patched. Duplicate most of the
patch_instruction() code instead of merging with it, to avoid the
performance degradation observed on ppc32, for patch_instruction(),
with the code path merged. Also, setup poking_init() always as BPF
expects poking_init() to be setup even when STRICT_KERNEL_RWX is off.

Signed-off-by: Hari Bathini 
Acked-by: Song Liu 



A lot of this is duplicate of patch_instruction(). Can we consolidate
thing between them?


True. The code was consolidated till v5 but had to duplicate most of it
to avoid performance degradation reported on ppc32:


https://lore.kernel.org/all/6cceb564-8b52-4d98-9118-92a914f48...@csgroup.eu/




---

Changes in v7:
* Fixed crash observed with !STRICT_RWX.


  arch/powerpc/include/asm/code-patching.h |   1 +
  arch/powerpc/lib/code-patching.c | 141 ++-
  2 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index 3f881548fb61..0e29ccf903d0 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -74,6 +74,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr,
  int patch_branch(u32 *addr, unsigned long target, int flags);
  int patch_instruction(u32 *addr, ppc_inst_t instr);
  int raw_patch_instruction(u32 *addr, ppc_inst_t instr);
+int patch_instructions(u32 *addr, u32 *code, size_t len, bool repeat_instr);
  
  static inline unsigned long patch_site_addr(s32 *site)

  {
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index b00112d7ad46..e1c1fd9246d8 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -204,9 +204,6 @@ void __init poking_init(void)
  {
int ret;
  
-	if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))

-   return;
-
if (mm_patch_enabled())
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"powerpc/text_poke_mm:online",
@@ -378,6 +375,144 @@ int patch_instruction(u32 *addr, ppc_inst_t instr)
  }
  NOKPROBE_SYMBOL(patch_instruction);
  
+static int __patch_instructions(u32 *patch_addr, u32 *code, size_t len, bool repeat_instr)

+{
+   unsigned long start = (unsigned long)patch_addr;
+
+   /* Repeat instruction */
+   if (repeat_instr) {
+   ppc_inst_t instr = ppc_inst_read(code);
+
+   if (ppc_inst_prefixed(instr)) {
+   u64 val = ppc_inst_as_ulong(instr);
+
+   memset64((u64 *)patch_addr, val, len / 8);
+   } else {
+   u32 val = ppc_inst_val(instr);
+
+   memset32(patch_addr, val, len / 4);
+   }
+   } else {
+   memcpy(patch_addr, code, len);
+   }
+
+   smp_wmb();  /* smp write barrier */
+   flush_icache_range(start, start + len);
+   return 0;
+}
+
+/*
+ * A page is mapped and instructions that fit the page are patched.
+ * Assumes 'len' to be (PAGE_SIZE - offset_in_page(addr)) or below.
+ */
+static int __do_patch_instructions_mm(u32 *addr, u32 *code, size_t len, bool 
repeat_instr)
+{
+   struct mm_struct *patching_mm, *orig_mm;
+   unsigned long pfn = get_patch_pfn(addr);
+   unsigned long text_poke_addr;
+   spinlock_t *ptl;
+   u32 *patch_addr;
+   pte_t *pte;
+   int err;
+
+   patching_mm = __this_cpu_read(cpu_patching_context.mm);
+   text_poke_addr = __this_cpu_read(cpu_patching_context.addr);
+   patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
+
+   pte = get_locked_pte(patching_mm, text_poke_addr, &ptl);
+   if (!pte)
+   return -ENOMEM;
+
+   __set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, 
PAGE_KERNEL), 0);
+
+   /* order PTE update before use, also serves as the hwsync */
+   asm volatile("ptesync" ::: "memory");
+
+   /* order context switch after arbitrary prior code */
+   isync();
+
+   orig_mm = start_using_temp_mm(patching_mm);
+
+   err = __patch_instructions(patch_addr, code, len, repeat_instr);
+
+   /* context synchronisation performed by __patch_instructions */
+   stop_using_temp_mm(patching_mm, orig_mm);
+
+   pte_clear(patching_mm, text_poke_addr, pte);
+   /*
+* ptesync to order PTE update before TLB invalidation done
+* by radix__local_flush_tlb_page_psize (in _tlbiel_va)
+*/
+   local_flush_tlb_page_psize(patchin

[PATCH] scsi: ibmvfc: replace deprecated strncpy with strscpy

2023-10-30 Thread Justin Stitt

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect these fields to be NUL-terminated as the property names from
which they are derived are also NUL-terminated.

Moreover, NUL-padding is not required as our destination buffers are
already NUL-allocated and any future NUL-byte assignments are redundant
(like the ones that strncpy() does).
ibmvfc_probe() ->
|   struct ibmvfc_host *vhost;
|   struct Scsi_Host *shost;
...
|   shost = scsi_host_alloc(&driver_template, sizeof(*vhost));
... **side note: is this a bug? Looks like a type to me   ^**
...
|   vhost = shost_priv(shost);

... where shost_priv() is:
|   static inline void *shost_priv(struct Scsi_Host *shost)
|   {
|   return (void *)shost->hostdata;
|   }

.. and:
scsi_host_alloc() ->
|   shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);

And for login_info->..., NUL-padding is also not required as it is
explicitly memset to 0:
|   memset(login_info, 0, sizeof(*login_info));

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Link: 
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
 [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-harden...@vger.kernel.org
Signed-off-by: Justin Stitt 
---
Note: build-tested only.

Found with: $ rg "strncpy\("
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index ce9eb00e2ca0..e73a39b1c832 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -1455,7 +1455,7 @@ static void ibmvfc_gather_partition_info(struct 
ibmvfc_host *vhost)
 
name = of_get_property(rootdn, "ibm,partition-name", NULL);
if (name)
-   strncpy(vhost->partition_name, name, 
sizeof(vhost->partition_name));
+   strscpy(vhost->partition_name, name, 
sizeof(vhost->partition_name));
num = of_get_property(rootdn, "ibm,partition-no", NULL);
if (num)
vhost->partition_number = *num;
@@ -1498,13 +1498,15 @@ static void ibmvfc_set_login_info(struct ibmvfc_host 
*vhost)
login_info->async.va = cpu_to_be64(vhost->async_crq.msg_token);
login_info->async.len = cpu_to_be32(async_crq->size *
sizeof(*async_crq->msgs.async));
-   strncpy(login_info->partition_name, vhost->partition_name, 
IBMVFC_MAX_NAME);
-   strncpy(login_info->device_name,
-   dev_name(&vhost->host->shost_gendev), IBMVFC_MAX_NAME);
+   strscpy(login_info->partition_name, vhost->partition_name,
+   sizeof(login_info->partition_name));
+
+   strscpy(login_info->device_name,
+   dev_name(&vhost->host->shost_gendev), 
sizeof(login_info->device_name));
 
location = of_get_property(of_node, "ibm,loc-code", NULL);
location = location ? location : dev_name(vhost->dev);
-   strncpy(login_info->drc_name, location, IBMVFC_MAX_NAME);
+   strscpy(login_info->drc_name, location, sizeof(login_info->drc_name));
 }
 
 /**

---
base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
change-id: 20231030-strncpy-drivers-scsi-ibmvscsi-ibmvfc-c-ccfce3255d58

Best regards,
--
Justin Stitt

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> On 10/27/23 20:21, Sean Christopherson wrote:
> > 
> > +   if (ioctl == KVM_SET_USER_MEMORY_REGION)
> > +   size = sizeof(struct kvm_userspace_memory_region);
> 
> This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
> access of the commit message becomes a kernel stack read.

Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
though as '0' is a valid file descriptor and a valid file offset.

> Probably worth adding a check on valid flags here.

Definitely needed.  There's a very real bug here.  But rather than duplicate 
flags
checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
have the fancy guard(mutex) and there are no internal calls to 
kvm_set_memory_region(),
what if we:

  1. Acquire/release slots_lock in __kvm_set_memory_region()
  2. Call kvm_set_memory_region() from x86 code for the internal memslots
  3. Disallow *any* flags for internal memslots
  4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
  5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE
 only for KVM_SET_USER_MEMORY_REGION2

E.g. this over ~5 patches

---
 arch/x86/kvm/x86.c   |  2 +-
 include/linux/kvm_host.h |  4 +--
 virt/kvm/kvm_main.c  | 65 +---
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e3eb608b6692..dd3e2017366c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12478,7 +12478,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
m.guest_phys_addr = gpa;
m.userspace_addr = hva;
m.memory_size = size;
-   r = __kvm_set_memory_region(kvm, &m);
+   r = kvm_set_memory_region(kvm, &m);
if (r < 0)
return ERR_PTR_USR(r);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 687589ce9f63..fbb98efe8200 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1170,7 +1170,7 @@ static inline bool kvm_memslot_iter_is_valid(struct 
kvm_memslot_iter *iter, gfn_
  *   -- just change its flags
  *
  * Since flags can be changed by some of these operations, the following
- * differentiation is the best we can do for __kvm_set_memory_region():
+ * differentiation is the best we can do for __kvm_set_memory_region().
  */
 enum kvm_mr_change {
KVM_MR_CREATE,
@@ -1181,8 +1181,6 @@ enum kvm_mr_change {
 
 int kvm_set_memory_region(struct kvm *kvm,
  const struct kvm_userspace_memory_region2 *mem);
-int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 23633984142f..39ceee2f67f2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,28 +1608,6 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
 }
 
-static int check_memory_region_flags(struct kvm *kvm,
-const struct kvm_userspace_memory_region2 
*mem)
-{
-   u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
-
-   if (kvm_arch_has_private_mem(kvm))
-   valid_flags |= KVM_MEM_PRIVATE;
-
-   /* Dirty logging private memory is not currently supported. */
-   if (mem->flags & KVM_MEM_PRIVATE)
-   valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
-
-#ifdef __KVM_HAVE_READONLY_MEM
-   valid_flags |= KVM_MEM_READONLY;
-#endif
-
-   if (mem->flags & ~valid_flags)
-   return -EINVAL;
-
-   return 0;
-}
-
 static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 {
struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, as_id);
@@ -2014,11 +1992,9 @@ static bool kvm_check_memslot_overlap(struct 
kvm_memslots *slots, int id,
  * space.
  *
  * Discontiguous memory is allowed, mostly for framebuffers.
- *
- * Must be called holding kvm->slots_lock for write.
  */
-int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region2 *mem)
+static int __kvm_set_memory_region(struct kvm *kvm,
+  const struct kvm_userspace_memory_region2 
*mem)
 {
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2028,9 +2004,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;
 
-   r = check_memory_region_flags(kvm, mem);
-   if (r)
-   return r;
+   guard(mutex)(&kvm->slots_lock);
 
as_id = mem->slot >> 16;
id = (u16)mem->slot;
@@ -2139,27 +2113,42 @@ int __kvm_set_memory_region(str

[linux-next:master] BUILD REGRESSION c503e3eec382ac708ee7adf874add37b77c5d312

2023-10-30 Thread kernel test robot

tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: c503e3eec382ac708ee7adf874add37b77c5d312  Add linux-next specific 
files for 20231030

Error/Warning reports:

https://lore.kernel.org/oe-kbuild-all/202310052201.anvbpgpr-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310121157.arky475m-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310171905.azfrkoid-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310261059.usl6vstf-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310302025.pokm9ves-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310302043.as36ufed-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310302206.pkr5ebdi-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310302338.mppqar10-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202310310055.rdwlonpr-...@intel.com

Error/Warning: (recently discovered and may have been fixed)

Warning: MAINTAINERS references a file that doesn't exist: 
Documentation/devicetree/bindings/iio/imu/bosch,bma400.yaml
aarch64-linux-ld: drivers/cxl/core/pci.c:921:(.text+0xbbc): undefined reference 
to `pci_print_aer'
arch/loongarch/kvm/mmu.c:411:6: warning: no previous prototype for 
'kvm_unmap_gfn_range' [-Wmissing-prototypes]
arch/loongarch/kvm/mmu.c:420:48: error: invalid use of undefined type 'struct 
kvm_gfn_range'
arch/loongarch/kvm/mmu.c:422:1: error: control reaches end of non-void function 
[-Werror=return-type]
arch/loongarch/kvm/mmu.c:424:6: warning: no previous prototype for 
'kvm_set_spte_gfn' [-Wmissing-prototypes]
arch/loongarch/kvm/mmu.c:456:6: warning: no previous prototype for 
'kvm_age_gfn' [-Wmissing-prototypes]
arch/loongarch/kvm/mmu.c:468:6: warning: no previous prototype for 
'kvm_test_age_gfn' [-Wmissing-prototypes]
arch/loongarch/kvm/mmu.c:787:24: error: 'struct kvm' has no member named 
'mmu_invalidate_seq'; did you mean 'mn_invalidate_lock'?
arch/loongarch/kvm/mmu.c:810:13: error: implicit declaration of function 
'mmu_invalidate_retry_hva' [-Werror=implicit-function-declaration]
arch/riscv/include/asm/mmio.h:67:(.text+0xd66): undefined reference to 
`pci_print_aer'
csky-linux-ld: pci.c:(.text+0x6e8): undefined reference to `pci_print_aer'
drivers/cxl/core/pci.c:921: undefined reference to `pci_print_aer'
drivers/cxl/core/pci.c:921:(.text+0xbc0): undefined reference to `pci_print_aer'
drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn35/dcn35_clk_mgr.c:1105 
dcn35_clk_mgr_construct() warn: variable dereferenced before check 
'ctx->dc_bios' (see line 1032)
drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn35/dcn35_clk_mgr.c:1105 
dcn35_clk_mgr_construct() warn: variable dereferenced before check 
'ctx->dc_bios->integrated_info' (see line 1032)
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/smu_v13_0_6_ppt.c:286:45: warning: 
'%s' directive output may be truncated writing up to 29 bytes into a region of 
size 23 [-Wformat-truncation=]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/smu_v13_0_6_ppt.c:286:52: warning: 
'%s' directive output may be truncated writing up to 29 bytes into a region of 
size 23 [-Wformat-truncation=]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu14/smu_v14_0.c:72:45: warning: '%s' 
directive output may be truncated writing up to 29 bytes into a region of size 
23 [-Wformat-truncation=]
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu14/smu_v14_0.c:72:52: warning: '%s' 
directive output may be truncated writing up to 29 bytes into a region of size 
23 [-Wformat-truncation=]
drivers/hwtracing/coresight/coresight-etm4x-core.c:1183:24: warning: result of 
comparison of constant 256 with expression of type 'u8' (aka 'unsigned char') 
is always false [-Wtautological-constant-out-of-range-compare]
include/linux/atomic/atomic-arch-fallback.h:384:27: error: implicit declaration 
of function 'arch_cmpxchg_local'; did you mean 'raw_cmpxchg_local'? 
[-Werror=implicit-function-declaration]
include/linux/compiler.h:212:29: error: pasting "__addressable_" and "(" does 
not give a valid preprocessing token
include/linux/export.h:47:9: error: pasting "__export_symbol_" and "(" does not 
give a valid preprocessing token
include/linux/stddef.h:8:15: error: expected declaration specifiers or '...' 
before '(' token
include/linux/stddef.h:8:16: error: expected identifier or '(' before 'void'
include/linux/stddef.h:8:23: error: expected ')' before numeric constant
include/linux/stddef.h:8:24: error: pasting ")" and "218" does not give a valid 
preprocessing token
kernel/bpf/task_iter.c:917:29: error: use of undeclared identifier 
'CSS_TASK_ITER_THREADED'
kernel/bpf/task_iter.c:917:7: error: use of undeclared iden

[PATCH] scsi: ibmvscsi: replace deprecated strncpy with strscpy

2023-10-30 Thread Justin Stitt

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect partition_name to be NUL-terminated based on its usage with
format strings:
|   dev_info(hostdata->dev, "host srp version: %s, "
|"host partition %s (%d), OS %d, max io %u\n",
|hostdata->madapter_info.srp_version,
|hostdata->madapter_info.partition_name,
|be32_to_cpu(hostdata->madapter_info.partition_number),
|be32_to_cpu(hostdata->madapter_info.os_type),
|be32_to_cpu(hostdata->madapter_info.port_max_txu[0]));
...
|   len = snprintf(buf, PAGE_SIZE, "%s\n",
|hostdata->madapter_info.partition_name);

Moreover, NUL-padding is not required as madapter_info is explicitly
memset to 0:
|   memset(&hostdata->madapter_info, 0x00,
|   sizeof(hostdata->madapter_info));

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

Link: 
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
 [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-harden...@vger.kernel.org
Signed-off-by: Justin Stitt 
---
Note: build-tested only.

Found with: $ rg "strncpy\("
---
 drivers/scsi/ibmvscsi/ibmvscsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 59599299615d..71f3e9563520 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -266,7 +266,7 @@ static void set_adapter_info(struct ibmvscsi_host_data 
*hostdata)
dev_info(hostdata->dev, "SRP_VERSION: %s\n", SRP_VERSION);
strcpy(hostdata->madapter_info.srp_version, SRP_VERSION);
 
-   strncpy(hostdata->madapter_info.partition_name, partition_name,
+   strscpy(hostdata->madapter_info.partition_name, partition_name,
sizeof(hostdata->madapter_info.partition_name));
 
hostdata->madapter_info.partition_number =

---
base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
change-id: 20231030-strncpy-drivers-scsi-ibmvscsi-ibmvscsi-c-3b5019ce50ad

Best regards,
--
Justin Stitt

Re: [PATCH v8 00/30] Add support for QMC HDLC, framer infrastructure and PEF2256 framer

2023-10-30 Thread Rob Herring

On Wed, Oct 25, 2023 at 12:32:15PM -0700, Jakub Kicinski wrote:
> On Wed, 25 Oct 2023 17:00:51 +0200 Herve Codina wrote:
> > > Which way will those patches go? Via some FSL SoC tree?  
> > 
> > This series seems mature now.
> > What is the plan next in order to have it applied ?
> > 
> > Don't hesitate to tell me if you prefer split series.
> 
> FWIW we are happy to take the drivers/net/ parts if there is no hard
> dependency. But there's no point taking that unless the SoC bits
> also go in for 6.7.
> 
> Li Yang, what are your expectations WRT merging this series?

I think it is too late for SoC stuff for 6.7. 

I picked up binding patches 6, 7, and 8 because 6 and 7 are the same as 
an additionalProperties fix I have in my tree. As 8 depends on them, I 
just picked it up too.

Rob

Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Chao Gao wrote:
> > On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
> > >From: Chao Peng 
> > >
> > >In confidential computing usages, whether a page is private or shared is
> > >necessary information for KVM to perform operations like page fault
> > >handling, page zapping etc. There are other potential use cases for
> > >per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > >or exec-only, etc.) without having to modify memslots.
> > >
> > >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > >userspace to operate on the per-page memory attributes.
> > >  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > >a guest memory range.
> > 
> > >  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > >memory attributes.
> > 
> > This ioctl() is already removed. So, the changelog is out-of-date and needs
> > an update.
> 
> Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below.
> 
> > >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > >+:Architectures: x86
> > >+:Type: vm ioctl
> > >+:Parameters: struct kvm_memory_attributes(in)
> > 
> >^ add one space here?
> 
> Ah, yeah, that does appear to be the standard.
> > 
> > 
> > >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
> > >+struct kvm_gfn_range *range)
> > >+{
> > >+  /*
> > >+   * Unconditionally add the range to the invalidation set, regardless of
> > >+   * whether or not the arch callback actually needs to zap SPTEs.  E.g.
> > >+   * if KVM supports RWX attributes in the future and the attributes are
> > >+   * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
> > >+   * adding the range allows KVM to require that MMU invalidations add at
> > >+   * least one range between begin() and end(), e.g. allows KVM to detect
> > >+   * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
> > 
> >  Relaxing
> > 
> > >@@ -4640,6 +4850,17 @@ static int 
> > >kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > >   case KVM_CAP_BINARY_STATS_FD:
> > >   case KVM_CAP_SYSTEM_EVENT_DATA:
> > >   return 1;
> > >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > >+  case KVM_CAP_MEMORY_ATTRIBUTES:
> > >+  u64 attrs = kvm_supported_mem_attributes(kvm);
> > >+
> > >+  r = -EFAULT;
> > >+  if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > >+  goto out;
> > >+  r = 0;
> > >+  break;
> > 
> > This cannot work, e.g., no @argp in this function and is fixed by a later 
> > commit:
> > 
> > fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for 
> > guest-specific backing memory")
> 
> I'll post a fixup patch for all of these, thanks much!

Heh, that was an -ENOCOFFEE.  Fixup patches for a changelog goof and an 
ephemeral
bug are going to be hard to post.

Paolo, do you want to take care of all of these fixups and typos, or would you
prefer that I start a v14 branch and then hand it off to you at some point?

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> On 10/27/23 20:21, Sean Christopherson wrote:
> > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
> > __kvm_handle_hva_range(struct kvm *kvm,
> >  * the second or later invocation of the handler).
> >  */
> > gfn_range.arg = range->arg;
> > +
> > +   /*
> > +* HVA-based notifications aren't relevant to private
> > +* mappings as they don't have a userspace mapping.
> 
> It's confusing who "they" is.  Maybe
> 
>* HVA-based notifications provide a userspace address,
>* and as such are only relevant for shared mappings.

Works for me.

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> > On 10/27/23 20:21, Sean Christopherson wrote:
> > > 
> > > + if (ioctl == KVM_SET_USER_MEMORY_REGION)
> > > + size = sizeof(struct kvm_userspace_memory_region);
> > 
> > This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
> > access of the commit message becomes a kernel stack read.
> 
> Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
> though as '0' is a valid file descriptor and a valid file offset.
> 
> > Probably worth adding a check on valid flags here.
> 
> Definitely needed.  There's a very real bug here.  But rather than duplicate 
> flags
> checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
> have the fancy guard(mutex) and there are no internal calls to 
> kvm_set_memory_region(),
> what if we:
> 
>   1. Acquire/release slots_lock in __kvm_set_memory_region()

Gah, this won't work with x86's usage, which acquire slots_lock quite far away
from __kvm_set_memory_region() in several places.  The rest of the idea still
works, kvm_vm_ioctl_set_memory_region() will just need to take slots_lock 
manually.

>   2. Call kvm_set_memory_region() from x86 code for the internal memslots
>   3. Disallow *any* flags for internal memslots
>   4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
>   5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE
>  only for KVM_SET_USER_MEMORY_REGION2

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Paolo Bonzini


On 10/30/23 21:25, Sean Christopherson wrote:

On Mon, Oct 30, 2023, Paolo Bonzini wrote:

On 10/27/23 20:21, Sean Christopherson wrote:


+   if (ioctl == KVM_SET_USER_MEMORY_REGION)
+   size = sizeof(struct kvm_userspace_memory_region);


This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds
access of the commit message becomes a kernel stack read.


Ouch.  There's some irony.  Might be worth doing memset(&mem, -1, sizeof(mem))
though as '0' is a valid file descriptor and a valid file offset.


Either is okay, because unless the flags check is screwed up it should
not matter.  The memset is actually unnecessary, though it may be a good
idea anyway to keep it, aka belt-and-suspenders.


Probably worth adding a check on valid flags here.


Definitely needed.  There's a very real bug here.  But rather than duplicate 
flags
checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
have the fancy guard(mutex) and there are no internal calls to 
kvm_set_memory_region(),
what if we:

   1. Acquire/release slots_lock in __kvm_set_memory_region()
   2. Call kvm_set_memory_region() from x86 code for the internal memslots
   3. Disallow *any* flags for internal memslots
   4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()


I dislike this step, there is a clear point where all paths meet
(ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
I think that's the place where flags should be checked.  (I don't mind
the restriction on internal memslots; it's just that to me it's not a
particularly natural way to structure the checks).

On the other hand, the place where to protect from out-of-bounds
accesses, is the place where you stop caring about struct
kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
your code gets it right, by dropping "ioctl" as soon as possible).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 87f45aa91ced..fe5a2af14fff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
return true;
 }
 
+/*

+ * Flags that do not access any of the extra space of struct
+ * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
+ * only allows these.
+ */
+#define KVM_SET_USER_MEMORY_REGION_FLAGS \
+   (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+
 static int check_memory_region_flags(struct kvm *kvm,
 const struct kvm_userspace_memory_region2 
*mem)
 {
@@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
struct kvm_userspace_memory_region2 mem;
unsigned long size;
 
-		if (ioctl == KVM_SET_USER_MEMORY_REGION)

+   if (ioctl == KVM_SET_USER_MEMORY_REGION) {
+   /*
+* Fields beyond struct kvm_userspace_memory_region 
shouldn't be
+* accessed, but avoid leaking kernel memory in case of 
a bug.
+*/
+   memset(&mem, 0, sizeof(mem));
size = sizeof(struct kvm_userspace_memory_region);
-   else
+   } else {
size = sizeof(struct kvm_userspace_memory_region2);
+   }
 
 		/* Ensure the common parts of the two structs are identical. */

SANITY_CHECK_MEM_REGION_FIELD(slot);
@@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
if (copy_from_user(&mem, argp, size))
goto out;
 
+		r = -EINVAL;

+   if (ioctl == KVM_SET_USER_MEMORY_REGION &&
+   (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
+   goto out;
+
r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
break;
}


That's a kind of patch that you can't really get wrong (though I have
the brown paper bag ready).

Maintainance-wise it's fine, since flags are being added at a pace of
roughly one every five years, and anyway it's also future proof: I placed
the #define near check_memory_region_flags so that in five years we remember
to keep it up to date.  But worst case, the new flags will only be allowed
by KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
waiting to bite us.

In sum, this is exactly the only kind of fix that should be in the v13->v14
delta.

Paolo

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Sean Christopherson

On Tue, Oct 31, 2023, Paolo Bonzini wrote:
> On 10/30/23 21:25, Sean Christopherson wrote:
> > > Probably worth adding a check on valid flags here.
> > 
> > Definitely needed.  There's a very real bug here.  But rather than 
> > duplicate flags
> > checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that 
> > we
> > have the fancy guard(mutex) and there are no internal calls to 
> > kvm_set_memory_region(),
> > what if we:
> > 
> >1. Acquire/release slots_lock in __kvm_set_memory_region()
> >2. Call kvm_set_memory_region() from x86 code for the internal memslots
> >3. Disallow *any* flags for internal memslots
> >4. Open code check_memory_region_flags in 
> > kvm_vm_ioctl_set_memory_region()
> 
> I dislike this step, there is a clear point where all paths meet
> (ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
> I think that's the place where flags should be checked.  (I don't mind
> the restriction on internal memslots; it's just that to me it's not a
> particularly natural way to structure the checks).

Yeah, I just don't like the discrepancy it causes where some flags are 
explicitly
checked and allowed, allowed and then later disallowed.

> On the other hand, the place where to protect from out-of-bounds
> accesses, is the place where you stop caring about struct
> kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
> your code gets it right, by dropping "ioctl" as soon as possible).
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 87f45aa91ced..fe5a2af14fff 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm 
> *kvm)
>   return true;
>  }
> +/*
> + * Flags that do not access any of the extra space of struct
> + * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
> + * only allows these.
> + */
> +#define KVM_SET_USER_MEMORY_REGION_FLAGS \

Can we name this KVM_SET_USER_MEMORY_REGION_LEGACY_FLAGS, or something equally
horrific?  As is, this sounds way too much like a generic "allowed flags for any
memory region".

Or maybe invert the macro?  I.e. something to make it more obvious that it's
effectively a versioning check, not a generic "what's supported?" check.

#define KVM_SET_USER_MEMORY_FLAGS_V2_ONLY \
(~(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY))


> + (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
> +
>  static int check_memory_region_flags(struct kvm *kvm,
>const struct kvm_userspace_memory_region2 
> *mem)
>  {
> @@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
>   struct kvm_userspace_memory_region2 mem;
>   unsigned long size;
> - if (ioctl == KVM_SET_USER_MEMORY_REGION)
> + if (ioctl == KVM_SET_USER_MEMORY_REGION) {
> + /*
> +  * Fields beyond struct kvm_userspace_memory_region 
> shouldn't be
> +  * accessed, but avoid leaking kernel memory in case of 
> a bug.
> +  */
> + memset(&mem, 0, sizeof(mem));
>   size = sizeof(struct kvm_userspace_memory_region);
> - else
> + } else {
>   size = sizeof(struct kvm_userspace_memory_region2);
> + }
>   /* Ensure the common parts of the two structs are identical. */
>   SANITY_CHECK_MEM_REGION_FIELD(slot);
> @@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
>   if (copy_from_user(&mem, argp, size))
>   goto out;
> + r = -EINVAL;
> + if (ioctl == KVM_SET_USER_MEMORY_REGION &&
> + (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
> + goto out;
> +
>   r = kvm_vm_ioctl_set_memory_region(kvm, &mem);
>   break;
>   }
> 
> 
> That's a kind of patch that you can't really get wrong (though I have
> the brown paper bag ready).
> 
> Maintainance-wise it's fine, since flags are being added at a pace of
> roughly one every five years,

Heh, true.

> and anyway it's also future proof: I placed the #define near
> check_memory_region_flags so that in five years we remember to keep it up to
> date.  But worst case, the new flags will only be allowed by
> KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
> waiting to bite us.
>
> In sum, this is exactly the only kind of fix that should be in the v13->v14
> delta.

Boiling the ocean can be fun too ;-)

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Xiaoyao Li


On 10/28/2023 2:21 AM, Sean Christopherson wrote:

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Xiaoyao Li 
Signed-off-by: Sean Christopherson 
---
  Documentation/virt/kvm/api.rst | 21 +++
  arch/x86/kvm/x86.c |  2 +-
  include/linux/kvm_host.h   |  4 ++--
  include/uapi/linux/kvm.h   | 13 
  virt/kvm/kvm_main.c| 38 +++---
  5 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 21a7578142a1..ace984acc125 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using 
the SET_ONE_REG
  interface. No error will be returned, but the resulting offset will not be
  applied.
  
+4.139 KVM_SET_USER_MEMORY_REGION2

+-
+
+:Capability: KVM_CAP_USER_MEMORY2
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_userspace_memory_region2 (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size; /* bytes */
+   __u64 userspace_addr; /* start of the userspace allocated memory */


missing

__u64 pad[16];


+  };
+
+See KVM_SET_USER_MEMORY_REGION.
+
  5. The kvm_run structure
  
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index 41cce5031126..6409914428ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12455,7 +12455,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
}
  
  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {

-   struct kvm_userspace_memory_region m;
+   struct kvm_userspace_memory_region2 m;
  
  		m.slot = id | (i << 16);

m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5faba69403ac..4e741ff27af3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1146,9 +1146,9 @@ enum kvm_mr_change {
  };
  
  int kvm_set_memory_region(struct kvm *kvm,

- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region2 *mem);
  int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region *mem);
+   const struct kvm_userspace_memory_region2 *mem);
  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
  int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13065dd96132..bd1abe067f28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
  };
  
+/* for KVM_SET_USER_MEMORY_REGION2 */

+struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size;
+   __u64 userspace_addr;
+   __u64 pad[16];
+};
+
  /*
   * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
   * userspace, other bits are reserved for kvm internal use which are defined
@@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
  #define KVM_CAP_COUNTER_OFFSET 227
  #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
  #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 230
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
@@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {

struct kvm_userspace_memory_region)
  #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+struct kvm_userspace_memory_region2)
  
  /* enable ucontrol for s390 */

  struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6e7

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-30 Thread Xiaoyao Li


On 10/28/2023 2:21 AM, Sean Christopherson wrote:
...

+KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
+allows mapping guest_memfd memory into a guest.  All fields shared with
+KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
+flags to have KVM bind the memory region to a given guest_memfd range of
+[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
+must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
+the target range must not be bound to any other memory region.  All standard
+bounds checks apply (use common sense).
+
  ::
  
struct kvm_userspace_memory_region2 {

@@ -6087,9 +6096,24 @@ applied.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+  __u64 guest_memfd_offset;


missing a tab


+   __u32 guest_memfd;
+   __u32 pad1;
+   __u64 pad2[14];
};

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-30 Thread Chao Gao

>+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>+{
>+  loff_t size = args->size;
>+  u64 flags = args->flags;
>+  u64 valid_flags = 0;
>+
>+  if (flags & ~valid_flags)
>+  return -EINVAL;
>+
>+  if (size < 0 || !PAGE_ALIGNED(size))
>+  return -EINVAL;

is size == 0 a valid case?

>+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>+unsigned int fd, loff_t offset)
>+{
>+  loff_t size = slot->npages << PAGE_SHIFT;
>+  unsigned long start, end;
>+  struct kvm_gmem *gmem;
>+  struct inode *inode;
>+  struct file *file;
>+
>+  BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
>+
>+  file = fget(fd);
>+  if (!file)
>+  return -EBADF;
>+
>+  if (file->f_op != &kvm_gmem_fops)
>+  goto err;
>+
>+  gmem = file->private_data;
>+  if (gmem->kvm != kvm)
>+  goto err;
>+
>+  inode = file_inode(file);
>+
>+  if (offset < 0 || !PAGE_ALIGNED(offset))
>+  return -EINVAL;

goto err;

or hoist the check above fget().

>+
>+  if (offset + size > i_size_read(inode))
>+  goto err;
>+
>+  filemap_invalidate_lock(inode->i_mapping);
>+
>+  start = offset >> PAGE_SHIFT;
>+  end = start + slot->npages;
>+
>+  if (!xa_empty(&gmem->bindings) &&
>+  xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
>+  filemap_invalidate_unlock(inode->i_mapping);
>+  goto err;
>+  }
>+
>+  /*
>+   * No synchronize_rcu() needed, any in-flight readers are guaranteed to
>+   * be see either a NULL file or this new file, no need for them to go
>+   * away.
>+   */
>+  rcu_assign_pointer(slot->gmem.file, file);
>+  slot->gmem.pgoff = start;
>+
>+  xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
>+  filemap_invalidate_unlock(inode->i_mapping);
>+
>+  /*
>+   * Drop the reference to the file, even on success.  The file pins KVM,
>+   * not the other way 'round.  Active bindings are invalidated if the
>+   * file is closed before memslots are destroyed.
>+   */
>+  fput(file);
>+  return 0;
>+
>+err:
>+  fput(file);
>+  return -EINVAL;

The cleanup, i.e., filemap_invalidate_unlock() and fput(), is common. So, I 
think it
may be slightly better to consolidate the common part e.g.,

int ret = -EINVAL;

...

xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
ret = 0;
unlock:
filemap_invalidate_unlock(inode->i_mapping);
err:
/*
 * Drop the reference to the file, even on success.  The file pins KVM,
 * not the other way 'round.  Active bindings are invalidated if the
 * file is closed before memslots are destroyed.
 */
fput(file);
return ret;

45 matches

Mail list logo