Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
On Thu, Oct 26, 2023 at 11:24:39AM +0100, Will Deacon wrote: > On Thu, Oct 26, 2023 at 11:58:00AM +0300, Mike Rapoport wrote: > > On Mon, Oct 23, 2023 at 06:14:20PM +0100, Will Deacon wrote: > > > On Mon, Sep 18, 2023 at 10:29:46AM +0300, Mike Rapoport wrote: > > > > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c > > > > index dd851297596e..cd6320de1c54 100644 > > > > --- a/arch/arm64/kernel/module.c > > > > +++ b/arch/arm64/kernel/module.c ... > > > > - if (module_direct_base) { > > > > - p = __vmalloc_node_range(size, MODULE_ALIGN, > > > > -module_direct_base, > > > > -module_direct_base + SZ_128M, > > > > -GFP_KERNEL | __GFP_NOWARN, > > > > -PAGE_KERNEL, 0, NUMA_NO_NODE, > > > > -__builtin_return_address(0)); > > > > - } > > > > + module_init_limits(); > > > > > > Hmm, this used to be run from subsys_initcall(), but now you're running > > > it _really_ early, before random_init(), so randomization of the module > > > space is no longer going to be very random if we don't have early entropy > > > from the firmware or the CPU, which is likely to be the case on most SoCs. > > > > Well, it will be as random as KASLR. Won't that be enough? > > I don't think that's true -- we have the 'kaslr-seed' property for KASLR, > but I'm not seeing anything like that for the module randomisation and I > also don't see why we need to set these limits so early. x86 needs execmem initialized before ftrace_init() so I thought it would be best to setup execmem along with most of MM in mm_core_init(). I'll move execmem initialization for !x86 to a later point, say core_initcall. > Will -- Sincerely yours, Mike.
Re: [PATCH 00/10] Remove obsolete and orphaned wifi drivers
Hi Arnd! There is some non-x86 hardware like the Amiga that still uses PCMCIA-style networking cards on machines like the A600 and A1200. So, unless these drivers are actually causing problems, I would rather not see them go yet. Thanks, Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH v12 6/6] powerpc: add crash memory hotplug support
Hello, On 30/10/23 06:13, kernel test robot wrote: Hi Sourabh, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on linus/master v6.6-rc7] [cannot apply to powerpc/next powerpc/fixes tip/x86/core next-20231027] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Sourabh-Jain/crash-forward-memory_notify-arg-to-arch-crash-hotplug-handler/20231029-213858 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20231029124039.6158-7-sourabhjain%40linux.ibm.com patch subject: [PATCH v12 6/6] powerpc: add crash memory hotplug support config: powerpc64-randconfig-001-20231029 (https://download.01.org/0day-ci/archive/20231030/202310300812.yn6fupcz-...@intel.com/config) compiler: powerpc64-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231030/202310300812.yn6fupcz-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202310300812.yn6fupcz-...@intel.com/ All errors (new ones prefixed by >>): powerpc64-linux-ld: warning: discarding dynamic section .glink powerpc64-linux-ld: warning: discarding dynamic section .plt powerpc64-linux-ld: linkage table error against `add_opal_mem_range' powerpc64-linux-ld: stubs don't match calculated size powerpc64-linux-ld: can not build stubs: bad value powerpc64-linux-ld: arch/powerpc/kexec/core_64.o: in function `get_crash_memory_ranges': core_64.c:(.text+0x1010): undefined reference to `add_mem_range' powerpc64-linux-ld: core_64.c:(.text+0x1064): undefined reference to `sort_memory_ranges' powerpc64-linux-ld: core_64.c:(.text+0x1128): undefined reference to `add_rtas_mem_range' powerpc64-linux-ld: core_64.c:(.text+0x1140): undefined reference to `add_opal_mem_range' powerpc64-linux-ld: core_64.c:(.text+0x1160): undefined reference to `add_mem_range' powerpc64-linux-ld: core_64.c:(.text+0x1188): undefined reference to `sort_memory_ranges' powerpc64-linux-ld: core_64.c:(.text+0x11e0): undefined reference to `realloc_mem_ranges' powerpc64-linux-ld: arch/powerpc/kexec/core_64.o: in function `arch_crash_handle_hotplug_event': core_64.c:(.text+0x1900): undefined reference to `remove_mem_range' The patch series uses a couple of functions defined in arch/powerpc/kexec/ranges.c. Since this file only builds when CONFIG_KEXEC_FILE is set, we are encountering this build issue. If CONFIG_KEXEC_FILE is set, then this issue is not observed. I will fix the above warnings and post the next version. Thanks, Sourabh
Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote: >From: Chao Peng > >In confidential computing usages, whether a page is private or shared is >necessary information for KVM to perform operations like page fault >handling, page zapping etc. There are other potential use cases for >per-page memory attributes, e.g. to make memory read-only (or no-exec, >or exec-only, etc.) without having to modify memslots. > >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow >userspace to operate on the per-page memory attributes. > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to >a guest memory range. > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported >memory attributes. This ioctl() is already removed. So, the changelog is out-of-date and needs an update. > >+ >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES >+:Architectures: x86 >+:Type: vm ioctl >+:Parameters: struct kvm_memory_attributes(in) ^ add one space here? >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm, >+struct kvm_gfn_range *range) >+{ >+ /* >+ * Unconditionally add the range to the invalidation set, regardless of >+ * whether or not the arch callback actually needs to zap SPTEs. E.g. >+ * if KVM supports RWX attributes in the future and the attributes are >+ * going from R=>RW, zapping isn't strictly necessary. Unconditionally >+ * adding the range allows KVM to require that MMU invalidations add at >+ * least one range between begin() and end(), e.g. allows KVM to detect >+ * bugs where the add() is missed. Rexlaing the rule *might* be safe, Relaxing >@@ -4640,6 +4850,17 @@ static int kvm_vm_ioctl_check_extension_generic(struct >kvm *kvm, long arg) > case KVM_CAP_BINARY_STATS_FD: > case KVM_CAP_SYSTEM_EVENT_DATA: > return 1; >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES >+ case KVM_CAP_MEMORY_ATTRIBUTES: >+ u64 attrs = kvm_supported_mem_attributes(kvm); >+ >+ r = -EFAULT; >+ if (copy_to_user(argp, &attrs, sizeof(attrs))) >+ goto out; >+ r = 0; >+ break; This cannot work, e.g., no @argp in this function and is fixed by a later commit: fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Re: [PATCH 00/10] Remove obsolete and orphaned wifi drivers
On Mon, Oct 30, 2023, at 08:19, John Paul Adrian Glaubitz wrote: > Hi Arnd! > > There is some non-x86 hardware like the Amiga that still uses > PCMCIA-style networking > cards on machines like the A600 and A1200. So, unless these drivers are > actually causing > problems, I would rather not see them go yet. Do you know of any systems other than the Amiga that are still in use with new kernels and that rely on PCMCIA networking? I know that Amiga has its own simple CONFIG_AMIGA_PCMCIA implementation that is incompatible with CONFIG_PCMCIA device drivers, so it is not affected by this. For the few ARM systems that still support a CF card slot, these tend to be used for IDE type storage cards because their internal flash is too limited otherwise. Arnd
Re: [RFC PATCH v8 00/13] Add audio support in v4l2 framework
On Mon, Oct 30, 2023 at 3:56 AM Shengjiu Wang wrote: > > On Fri, Oct 27, 2023 at 7:18 PM Hans Verkuil wrote: > > > > Hi Shengjiu, > > > > Is there a reason why this series is still marked RFC? > > > > Just wondering about that. > > In the very beginning I started this series with RFC, So > I still use RFC now... I think we are way past the RFC stage so if there will be another version, it is better to remove the RFC tag. RFC means that this patch series is not ready to be merged but is merely in the incipient stages of development. thanks, Daniel.
Re: [PATCH v7 1/5] powerpc/code-patching: introduce patch_instructions()
Hari Bathini writes: > patch_instruction() entails setting up pte, patching the instruction, > clearing the pte and flushing the tlb. If multiple instructions need > to be patched, every instruction would have to go through the above > drill unnecessarily. Instead, introduce patch_instructions() function > that sets up the pte, clears the pte and flushes the tlb only once > per page range of instructions to be patched. Duplicate most of the > patch_instruction() code instead of merging with it, to avoid the > performance degradation observed on ppc32, for patch_instruction(), > with the code path merged. Also, setup poking_init() always as BPF > expects poking_init() to be setup even when STRICT_KERNEL_RWX is off. > > Signed-off-by: Hari Bathini > Acked-by: Song Liu > A lot of this is duplicate of patch_instruction(). Can we consolidate thing between them? > --- > > Changes in v7: > * Fixed crash observed with !STRICT_RWX. > > > arch/powerpc/include/asm/code-patching.h | 1 + > arch/powerpc/lib/code-patching.c | 141 ++- > 2 files changed, 139 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/include/asm/code-patching.h > b/arch/powerpc/include/asm/code-patching.h > index 3f881548fb61..0e29ccf903d0 100644 > --- a/arch/powerpc/include/asm/code-patching.h > +++ b/arch/powerpc/include/asm/code-patching.h > @@ -74,6 +74,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr, > int patch_branch(u32 *addr, unsigned long target, int flags); > int patch_instruction(u32 *addr, ppc_inst_t instr); > int raw_patch_instruction(u32 *addr, ppc_inst_t instr); > +int patch_instructions(u32 *addr, u32 *code, size_t len, bool repeat_instr); > > static inline unsigned long patch_site_addr(s32 *site) > { > diff --git a/arch/powerpc/lib/code-patching.c > b/arch/powerpc/lib/code-patching.c > index b00112d7ad46..e1c1fd9246d8 100644 > --- a/arch/powerpc/lib/code-patching.c > +++ b/arch/powerpc/lib/code-patching.c > @@ -204,9 +204,6 @@ void __init poking_init(void) > { > int ret; > > - if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) > - return; > - > if (mm_patch_enabled()) > ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, > "powerpc/text_poke_mm:online", > @@ -378,6 +375,144 @@ int patch_instruction(u32 *addr, ppc_inst_t instr) > } > NOKPROBE_SYMBOL(patch_instruction); > > +static int __patch_instructions(u32 *patch_addr, u32 *code, size_t len, bool > repeat_instr) > +{ > + unsigned long start = (unsigned long)patch_addr; > + > + /* Repeat instruction */ > + if (repeat_instr) { > + ppc_inst_t instr = ppc_inst_read(code); > + > + if (ppc_inst_prefixed(instr)) { > + u64 val = ppc_inst_as_ulong(instr); > + > + memset64((u64 *)patch_addr, val, len / 8); > + } else { > + u32 val = ppc_inst_val(instr); > + > + memset32(patch_addr, val, len / 4); > + } > + } else { > + memcpy(patch_addr, code, len); > + } > + > + smp_wmb(); /* smp write barrier */ > + flush_icache_range(start, start + len); > + return 0; > +} > + > +/* > + * A page is mapped and instructions that fit the page are patched. > + * Assumes 'len' to be (PAGE_SIZE - offset_in_page(addr)) or below. > + */ > +static int __do_patch_instructions_mm(u32 *addr, u32 *code, size_t len, bool > repeat_instr) > +{ > + struct mm_struct *patching_mm, *orig_mm; > + unsigned long pfn = get_patch_pfn(addr); > + unsigned long text_poke_addr; > + spinlock_t *ptl; > + u32 *patch_addr; > + pte_t *pte; > + int err; > + > + patching_mm = __this_cpu_read(cpu_patching_context.mm); > + text_poke_addr = __this_cpu_read(cpu_patching_context.addr); > + patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr)); > + > + pte = get_locked_pte(patching_mm, text_poke_addr, &ptl); > + if (!pte) > + return -ENOMEM; > + > + __set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, > PAGE_KERNEL), 0); > + > + /* order PTE update before use, also serves as the hwsync */ > + asm volatile("ptesync" ::: "memory"); > + > + /* order context switch after arbitrary prior code */ > + isync(); > + > + orig_mm = start_using_temp_mm(patching_mm); > + > + err = __patch_instructions(patch_addr, code, len, repeat_instr); > + > + /* context synchronisation performed by __patch_instructions */ > + stop_using_temp_mm(patching_mm, orig_mm); > + > + pte_clear(patching_mm, text_poke_addr, pte); > + /* > + * ptesync to order PTE update before TLB invalidation done > + * by radix__local_flush_tlb_page_psize (in _tlbiel_va) > + */ > + local_flush_tlb_page_psize(patching_mm, text_poke_addr, > mmu_virtual_psize); > + > + pte_unmap_unlock(pte, ptl); > + > + return err; > +} > + > +/* > + *
Re: [PATCH 02/12] powerpc/pseries: Restructure hvc_get_chars() endianness
Benjamin Gray writes: > Sparse reports an endian mismatch in hvc_get_chars(). > > At first it seemed like the retbuf should be __be64[], but actually > retbuf holds serialized registers returned by the hypervisor call, so > it's correctly CPU endian typed. > > Instead, it is the be64_to_cpu() that's misleading. The intent is to do > a byte swap on a little endian CPU. The swap is required because we > wanted to store the register values to memory without 'swapping' bytes, > so that the high order byte of the first register is the first byte > in the result buffer. > > In diagram form, on a LE CPU with the letters representing the return > string bytes: > > (register bytes) A B C D E F G H I J K L M N O P > (retbuf mem bytes) H G F E D C B A P O N M L K J I > (buf/lbuf mem bytes) A B C D E F G H I J K L M N O P > > So retbuf stores the registers in CPU endian, and buf always stores in > big endian. > > So replace the byte swap function with cpu_to_be64() and cast lbuf as an > array of __be64 to match the semantics closer. > > Signed-off-by: Benjamin Gray > --- > arch/powerpc/platforms/pseries/hvconsole.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/hvconsole.c > b/arch/powerpc/platforms/pseries/hvconsole.c > index 1ac52963e08b..647718a15e78 100644 > --- a/arch/powerpc/platforms/pseries/hvconsole.c > +++ b/arch/powerpc/platforms/pseries/hvconsole.c > @@ -29,11 +29,11 @@ int hvc_get_chars(uint32_t vtermno, char *buf, int count) > { > long ret; > unsigned long retbuf[PLPAR_HCALL_BUFSIZE]; > - unsigned long *lbuf = (unsigned long *)buf; > + __be64 *lbuf = (__be64 __force *)buf; > > ret = plpar_hcall(H_GET_TERM_CHAR, retbuf, vtermno); > - lbuf[0] = be64_to_cpu(retbuf[1]); > - lbuf[1] = be64_to_cpu(retbuf[2]); > + lbuf[0] = cpu_to_be64(retbuf[1]); > + lbuf[1] = cpu_to_be64(retbuf[2]); > > if (ret == H_SUCCESS) > return retbuf[0]; > There is no functionality change in this patch. It is clarifying the details that it expect the buf to have the big-endian format and retbuf contains native endian format. Not sure why this was not picked. Reviewed-by: Aneesh Kumar K.V
Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
On Mon, Oct 30, 2023, Chao Gao wrote: > On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote: > >From: Chao Peng > > > >In confidential computing usages, whether a page is private or shared is > >necessary information for KVM to perform operations like page fault > >handling, page zapping etc. There are other potential use cases for > >per-page memory attributes, e.g. to make memory read-only (or no-exec, > >or exec-only, etc.) without having to modify memslots. > > > >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow > >userspace to operate on the per-page memory attributes. > > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to > >a guest memory range. > > > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported > >memory attributes. > > This ioctl() is already removed. So, the changelog is out-of-date and needs > an update. Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below. > >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES > >+:Architectures: x86 > >+:Type: vm ioctl > >+:Parameters: struct kvm_memory_attributes(in) > > ^ add one space here? Ah, yeah, that does appear to be the standard. > > > >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm, > >+ struct kvm_gfn_range *range) > >+{ > >+/* > >+ * Unconditionally add the range to the invalidation set, regardless of > >+ * whether or not the arch callback actually needs to zap SPTEs. E.g. > >+ * if KVM supports RWX attributes in the future and the attributes are > >+ * going from R=>RW, zapping isn't strictly necessary. Unconditionally > >+ * adding the range allows KVM to require that MMU invalidations add at > >+ * least one range between begin() and end(), e.g. allows KVM to detect > >+ * bugs where the add() is missed. Rexlaing the rule *might* be safe, > > Relaxing > > >@@ -4640,6 +4850,17 @@ static int > >kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > > case KVM_CAP_BINARY_STATS_FD: > > case KVM_CAP_SYSTEM_EVENT_DATA: > > return 1; > >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES > >+case KVM_CAP_MEMORY_ATTRIBUTES: > >+u64 attrs = kvm_supported_mem_attributes(kvm); > >+ > >+r = -EFAULT; > >+if (copy_to_user(argp, &attrs, sizeof(attrs))) > >+goto out; > >+r = 0; > >+break; > > This cannot work, e.g., no @argp in this function and is fixed by a later > commit: > > fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for > guest-specific backing memory") I'll post a fixup patch for all of these, thanks much!
Re: Several kmemleak reports + "refcount_t: underflow; use-after-free" at boot when OF_UNITTEST + OF_OVERLAY is set (Kernel v6.6-rc6, PowerMac G5 11,2)
On Wed, Oct 18, 2023 at 4:38 PM Erhard Furtner wrote: > > Greetings! > > Getting this at every boot on my G5 with kernel v6.6-rc6 with OF_UNITTEST and > OF_OVERLAY selected: > > [...] > ### dt-test ### EXPECT \ : OF: ERROR: of_node_release() detected bad > of_node_put() on /testcase-data/refcount-node ### dt-test ### pass > of_unittest_lifecycle():3189 OF: ERROR: of_node_release() detected bad > of_node_put() on /testcase-data/refcount-node ### dt-test ### EXPECT / : OF: > ERROR: of_node_release() detected bad of_node_put() on > /testcase-data/refcount-node ### dt-test ### EXPECT \ : [ cut here > ] ### dt-test ### EXPECT \ : WARNING: <> ### dt-test ### > EXPECT \ : refcount_t: underflow; use-after-free. ### dt-test ### EXPECT \ : The test tells you to expect a use-after-free... > ---[ end trace <> ]--- ### dt-test ### pass of_unittest_lifecycle():3209 > [ cut here ] > refcount_t: underflow; use-after-free. Then you get a use-after-free. Looks like it is working as designed. I believe it's the same with kmemleak. Note that running DT unittests also taints the kernel. That's because they are not meant to be run on a production system. Rob
Re: [PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress *never* goes negative
On 10/27/23 20:21, Sean Christopherson wrote: Move the assertion on the in-progress invalidation count from the primary MMU's notifier path to KVM's common notification path, i.e. assert that the count doesn't go negative even when the invalidation is coming from KVM itself. Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only the affected VM, not the entire kernel. A corrupted count is fatal to the VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry() to block any and all attempts to install new mappings. But it's far from guaranteed that an end() without a start() is fatal or even problematic to anything other than the target VM, e.g. the underlying bug could simply be a duplicate call to end(). And it's much more likely that a missed invalidation, i.e. a potential use-after-free, would manifest as no notification whatsoever, not an end() without a start(). Reviewed-by: Paolo Bonzini Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0524933856d4..5a97e6c7d9c2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start, * in conjunction with the smp_rmb in mmu_invalidate_retry(). */ kvm->mmu_invalidate_in_progress--; + KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm); } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, */ if (wake) rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait); - - BUG_ON(kvm->mmu_invalidate_in_progress < 0); } static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
On 10/27/23 20:21, Sean Christopherson wrote: From: Chao Peng Currently in mmu_notifier invalidate path, hva range is recorded and then checked against by mmu_notifier_retry_hva() in the page fault handling path. However, for the to be introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing hva based shared memory, gfn is expected to also work. The only downside is when aliasing multiple gfns to a single hva, the current algorithm of checking multiple ranges could result in a much larger range being rejected. Such aliasing should be uncommon, so the impact is expected small. Reviewed-by: Paolo Bonzini Paolo
Re: [PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction
On 10/27/23 20:21, Sean Christopherson wrote: Add an assertion that there are no in-progress MMU invalidations when a VM is being destroyed, with the exception of the scenario where KVM unregisters its MMU notifier between an .invalidate_range_start() call and the corresponding .invalidate_range_end(). KVM can't detect unpaired calls from the mmu_notifier due to the above exception waiver, but the assertion can detect KVM bugs, e.g. such as the bug that *almost* escaped initial guest_memfd development. Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com Signed-off-by: Sean Christopherson Reviewed-by: Paolo Bonzini Paolo
Re: [PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
On 10/27/23 20:21, Sean Christopherson wrote: Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are defined when KVM is enabled, and return '1' unconditionally for the CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path. All flavors of PPC support for KVM select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally defined by arch/powerpc/include/asm/kvm_host.h. Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e. will allow combining all of the #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) checks into a single #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER without having to worry about PPC's "bare" usage of KVM_ARCH_WANT_MMU_NOTIFIER. Signed-off-by: Sean Christopherson --- arch/powerpc/kvm/powerpc.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 7197c8256668..b0a512ede764 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; #endif case KVM_CAP_SYNC_MMU: +#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER) + BUILD_BUG(); +#endif #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE r = hv_enabled; -#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER) - r = 1; #else - r = 0; + r = 1; #endif break; #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE Reviewed-by: Paolo Bonzini
Re: [PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
On 10/27/23 20:21, Sean Christopherson wrote: Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where appropriate to effectively maintain existing behavior. Using a proper Kconfig will simplify building more functionality on top of KVM's mmu_notifier infrastructure. Add a forward declaration of kvm_gfn_range to kvm_types.h so that including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't generate warnings due to kvm_gfn_range being undeclared. PPC defines hooks for PR vs. HV without guarding them via #ifdeffery, e.g. bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); Alternatively, PPC could forward declare kvm_gfn_range, but there's no good reason not to define it in common KVM. The new #define should also imply KVM_CAP_SYNC_MMU, or even: KVM_CAP_SYNC_MMU should just be enabled by all architectures at this point. You don't need to care about it, I have a larger series for caps that are enabled by all architectures and I'll post it for 6.8. Reviewed-by: Paolo Bonzini Paolo
Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
On 10/27/23 20:21, Sean Christopherson wrote: + if (ioctl == KVM_SET_USER_MEMORY_REGION) + size = sizeof(struct kvm_userspace_memory_region); This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds access of the commit message becomes a kernel stack read. Probably worth adding a check on valid flags here. Paolo + else + size = sizeof(struct kvm_userspace_memory_region2); + + /* Ensure the common parts of the two structs are identical. */ + SANITY_CHECK_MEM_REGION_FIELD(slot); + SANITY_CHECK_MEM_REGION_FIELD(flags); + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); + SANITY_CHECK_MEM_REGION_FIELD(memory_size); + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
On 2023-10-27 11:21 AM, Sean Christopherson wrote: > From: Chao Peng > > Currently in mmu_notifier invalidate path, hva range is recorded and > then checked against by mmu_notifier_retry_hva() in the page fault > handling path. However, for the to be introduced private memory, a page Is there a missing word here? > fault may not have a hva associated, checking gfn(gpa) makes more sense. > > For existing hva based shared memory, gfn is expected to also work. The > only downside is when aliasing multiple gfns to a single hva, the > current algorithm of checking multiple ranges could result in a much > larger range being rejected. Such aliasing should be uncommon, so the > impact is expected small. > > Suggested-by: Sean Christopherson > Cc: Xu Yilun > Signed-off-by: Chao Peng > Reviewed-by: Fuad Tabba > Tested-by: Fuad Tabba > [sean: convert vmx_set_apic_access_page_addr() to gfn-based API] > Signed-off-by: Sean Christopherson > --- > arch/x86/kvm/mmu/mmu.c | 10 ++ > arch/x86/kvm/vmx/vmx.c | 11 +-- > include/linux/kvm_host.h | 33 > virt/kvm/kvm_main.c | 41 +++- > 4 files changed, 64 insertions(+), 31 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index f7901cb4d2fa..d33657d61d80 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, > u64 *sptep) > * > * There are several ways to safely use this helper: > * > - * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, > before > + * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, > before > * consuming it. In this case, mmu_lock doesn't need to be held during the > * lookup, but it does need to be held while checking the MMU notifier. > * > @@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, > return true; > > return fault->slot && > -mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva); > +mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn); > } > > static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault > *fault) > @@ -6245,7 +6245,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t > gfn_start, gfn_t gfn_end) > > write_lock(&kvm->mmu_lock); > > - kvm_mmu_invalidate_begin(kvm, 0, -1ul); > + kvm_mmu_invalidate_begin(kvm); > + > + kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end); > > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end); > > @@ -6255,7 +6257,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t > gfn_start, gfn_t gfn_end) > if (flush) > kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - > gfn_start); > > - kvm_mmu_invalidate_end(kvm, 0, -1ul); > + kvm_mmu_invalidate_end(kvm); > > write_unlock(&kvm->mmu_lock); > } > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 72e3943f3693..6e502ba93141 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct > kvm_vcpu *vcpu) > return; > > /* > - * Grab the memslot so that the hva lookup for the mmu_notifier retry > - * is guaranteed to use the same memslot as the pfn lookup, i.e. rely > - * on the pfn lookup's validation of the memslot to ensure a valid hva > - * is used for the retry check. > + * Explicitly grab the memslot using KVM's internal slot ID to ensure > + * KVM doesn't unintentionally grab a userspace memslot. It _should_ > + * be impossible for userspace to create a memslot for the APIC when > + * APICv is enabled, but paranoia won't hurt in this case. >*/ > slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT); > if (!slot || slot->flags & KVM_MEMSLOT_INVALID) > @@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct > kvm_vcpu *vcpu) > return; > > read_lock(&vcpu->kvm->mmu_lock); > - if (mmu_invalidate_retry_hva(kvm, mmu_seq, > - gfn_to_hva_memslot(slot, gfn))) { > + if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) { > kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); > read_unlock(&vcpu->kvm->mmu_lock); > goto out; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index fb6c6109fdca..11d091688346 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -787,8 +787,8 @@ struct kvm { > struct mmu_notifier mmu_notifier; > unsigned long mmu_invalidate_seq; > long mmu_invalidate_in_progress; > - unsigned long mmu_invalidate_range_start; > - unsigned long mmu_invalidate_range_end; > +
Re: [PATCH 00/10] Remove obsolete and orphaned wifi drivers
John Paul Adrian Glaubitz writes: > There is some non-x86 hardware like the Amiga that still uses > PCMCIA-style networking cards on machines like the A600 and A1200. So, > unless these drivers are actually causing problems, I would rather not > see them go yet. There is a cost maintaining these drivers so I would like to see more information in the report, at least: * exact driver and hardware used * last kernel version tested * kernel log messages from a successful case Here's a good example: https://lore.kernel.org/linux-wireless/20231024014302.0a0b79b0@mir/ -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
On Mon, Oct 30, 2023 at 5:53 PM David Matlack wrote: > > On 2023-10-27 11:21 AM, Sean Christopherson wrote: > > From: Chao Peng > > > > Currently in mmu_notifier invalidate path, hva range is recorded and > > then checked against by mmu_notifier_retry_hva() in the page fault > > handling path. However, for the to be introduced private memory, a page > > > Is there a missing word here? No but there could be missing hyphens ("for the to-be-introduced private memory"); possibly a "soon" could help parsing and that is what you were talking about? > > if (likely(kvm->mmu_invalidate_in_progress == 1)) { > > + kvm->mmu_invalidate_range_start = INVALID_GPA; > > + kvm->mmu_invalidate_range_end = INVALID_GPA; > > I don't think this is incorrect, but I was a little suprised to see this > here rather than in end() when mmu_invalidate_in_progress decrements to > 0. I think that would be incorrect on the very first start? > > + } > > +} > > + > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end) > > +{ > > + lockdep_assert_held_write(&kvm->mmu_lock); > > Does this compile/function on KVM architectures with > !KVM_HAVE_MMU_RWLOCK? Yes: #define lockdep_assert_held_write(l)\ lockdep_assert(lockdep_is_held_type(l, 0)) where 0 is the lock-held type used by lock_acquire_exclusive. In turn is what you get for a spinlock or mutex, in addition to a rwlock or rwsem that is taken for write. Instead, lockdep_assert_held() asserts that the lock is taken without asserting a particular lock-held type. > > @@ -834,6 +851,12 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned > > long start, > > Let's add a lockdep_assert_held_write(&kvm->mmu_lock) here too while > we're at it? Yes, good idea. Paolo
Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
On 10/27/23 20:21, Sean Christopherson wrote: Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having __kvm_handle_hva_range() return whether or not an overlapping memslot was found, i.e. mmu_lock was acquired. Using the .on_unlock() hook works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical. Use a small struct to return the tuple of the notifier-specific return, plus whether or not overlap was found. Because the iteration helpers are __always_inlined, practically speaking, the struct will never actually be returned from a function call (not to mention the size of the struct will be two bytes in practice). Could have been split in two patches, but it's fine anyway. Reviewed-by: Paolo Bonzini Paolo
Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook
On 10/27/23 20:21, Sean Christopherson wrote: Drop the .on_unlock() mmu_notifer hook now that it's no longer used for notifying arch code that memory has been reclaimed. Adding .on_unlock() and invoking it *after* dropping mmu_lock was a terrible idea, as doing so resulted in .on_lock() and .on_unlock() having divergent and asymmetric behavior, and set future developers up for failure, i.e. all but asked for bugs where KVM relied on using .on_unlock() to try to run a callback while holding mmu_lock. Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to guard against future bugs of this nature. This is what David suggested to do in patch 3, FWIW. Reviewed-by: Paolo Bonzini Paolo Reported-by: Isaku Yamahata Link: https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com Signed-off-by: Sean Christopherson ---
Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
On 10/27/23 20:21, Sean Christopherson wrote: @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, * the second or later invocation of the handler). */ gfn_range.arg = range->arg; + + /* +* HVA-based notifications aren't relevant to private +* mappings as they don't have a userspace mapping. It's confusing who "they" is. Maybe * HVA-based notifications provide a userspace address, * and as such are only relevant for shared mappings. Paolo +*/ + gfn_range.only_private = false; + gfn_range.only_shared = true; gfn_range.may_block = range->may_block; /*
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On 10/27/23 20:21, Sean Christopherson wrote: From: Chao Peng Add a new KVM exit type to allow userspace to handle memory faults that KVM cannot resolve, but that userspace *may* be able to handle (without terminating the guest). KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit conversions between private and shared memory. With guest private memory, there will be two kind of memory conversions: - explicit conversion: happens when the guest explicitly calls into KVM to map a range (as private or shared) - implicit conversion: happens when the guest attempts to access a gfn that is configured in the "wrong" state (private vs. shared) On x86 (first architecture to support guest private memory), explicit conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable as there is (obviously) no hypercall, and there is no guarantee that the guest actually intends to convert between private and shared, i.e. what KVM thinks is an implicit conversion "request" could actually be the result of a guest code bug. KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to be implicit conversions. Note! To allow for future possibilities where KVM reports KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's perspective), not '0'! Due to historical baggage within KVM, exiting to userspace with '0' from deep callstacks, e.g. in emulation paths, is infeasible as doing so would require a near-complete overhaul of KVM, whereas KVM already propagates -errno return codes to userspace even when the -errno originated in a low level helper. Report the gpa+size instead of a single gfn even though the initial usage is expected to always report single pages. It's entirely possible, likely even, that KVM will someday support sub-page granularity faults, e.g. Intel's sub-page protection feature allows for additional protections at 128-byte granularity. Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com Cc: Anish Moorthy Cc: David Matlack Suggested-by: Sean Christopherson Co-developed-by: Yu Zhang Signed-off-by: Yu Zhang Signed-off-by: Chao Peng Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Reviewed-by: Paolo Bonzini --- Documentation/virt/kvm/api.rst | 41 ++ arch/x86/kvm/x86.c | 1 + include/linux/kvm_host.h | 11 + include/uapi/linux/kvm.h | 8 +++ 4 files changed, 61 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index ace984acc125..860216536810 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 size; + } memory; + +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field +describes properties of the faulting access that are likely pertinent. +Currently, no flags are defined. + +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume +kvm_run.exit_reason is stale/undefined for all other error numbers. + :: /* KVM_EXIT_NOTIFY */ @@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. +7.34 KVM_CAP_MEMORY_FAULT_INFO +-- + +:Architectures: x86 +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the
Re: [PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
On 10/27/23 20:21, Sean Christopherson wrote: Add an "unmovable" flag for mappings that cannot be migrated under any circumstance. KVM will use the flag for its upcoming GUEST_MEMFD support, which will not support compaction/migration, at least not in the foreseeable future. Test AS_UNMOVABLE under folio lock as already done for the async compaction/dirty folio case, as the mapping can be removed by truncation while compaction is running. To avoid having to lock every folio with a mapping, assume/require that unmovable mappings are also unevictable, and have mapping_set_unmovable() also set AS_UNEVICTABLE. Cc: Matthew Wilcox Co-developed-by: Vlastimil Babka Signed-off-by: Vlastimil Babka Signed-off-by: Sean Christopherson I think this could even be "From: Vlastimil", but no biggie. Paolo --- include/linux/pagemap.h | 19 +- mm/compaction.c | 43 + mm/migrate.c| 2 ++ 3 files changed, 51 insertions(+), 13 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 351c3b7f93a1..82c9bf506b79 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -203,7 +203,8 @@ enum mapping_flags { /* writeback related tags are not used */ AS_NO_WRITEBACK_TAGS = 5, AS_LARGE_FOLIO_SUPPORT = 6, - AS_RELEASE_ALWAYS, /* Call ->release_folio(), even if no private data */ + AS_RELEASE_ALWAYS = 7, /* Call ->release_folio(), even if no private data */ + AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */ }; /** @@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct address_space *mapping) clear_bit(AS_RELEASE_ALWAYS, &mapping->flags); } +static inline void mapping_set_unmovable(struct address_space *mapping) +{ + /* +* It's expected unmovable mappings are also unevictable. Compaction +* migrate scanner (isolate_migratepages_block()) relies on this to +* reduce page locking. +*/ + set_bit(AS_UNEVICTABLE, &mapping->flags); + set_bit(AS_UNMOVABLE, &mapping->flags); +} + +static inline bool mapping_unmovable(struct address_space *mapping) +{ + return test_bit(AS_UNMOVABLE, &mapping->flags); +} + static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { return mapping->gfp_mask; diff --git a/mm/compaction.c b/mm/compaction.c index 38c8d216c6a3..12b828aed7c8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* Time to isolate some pages for migration */ for (; low_pfn < end_pfn; low_pfn++) { + bool is_dirty, is_unevictable; if (skip_on_failure && low_pfn >= next_skip_pfn) { /* @@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (!folio_test_lru(folio)) goto isolate_fail_put; + is_unevictable = folio_test_unevictable(folio); + /* Compaction might skip unevictable pages but CMA takes them */ - if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio)) + if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable) goto isolate_fail_put; /* @@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio)) goto isolate_fail_put; - if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) { - bool migrate_dirty; + is_dirty = folio_test_dirty(folio); + + if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) || + (mapping && is_unevictable)) { + bool migrate_dirty = true; + bool is_unmovable; /* * Only folios without mappings or that have -* a ->migrate_folio callback are possible to -* migrate without blocking. However, we may -* be racing with truncation, which can free -* the mapping. Truncation holds the folio lock -* until after the folio is removed from the page -* cache so holding it ourselves is sufficient. +* a ->migrate_folio callback are possible to migrate +* without blocking. +* +* Folios from unmovable mappings are not migratable. +* +* However, we can be racing with truncation, which can +* free the mapping that we need to check. Truncation +* holds the folio loc
Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM
On 10/27/23 20:21, Sean Christopherson wrote: Export anon_inode_getfile_secure() so that it can be used by KVM to create and manage file-based guest memory without need a fullblow without introducing a full-blown Otherwise, Reviewed-by: Paolo Bonzini Paolo filesystem. The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM needs a unique inode for each file, e.g. to be able to independently manage the size and lifecycle of a given file.
Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
On 10/27/23 20:22, Sean Christopherson wrote: Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce the probability of exiting to userspace with a stale run->exit_reason that *appears* to be valid. To support fd-based guest memory (guest memory without a corresponding userspace virtual address), KVM will exit to userspace for various memory related errors, which userspace *may* be able to resolve, instead of using e.g. BUS_MCEERR_AR. And in the more distant future, KVM will also likely utilize the same functionality to let userspace "intercept" and handle memory faults when the userspace mapping is missing, i.e. when fast gup() fails. Because many of KVM's internal APIs related to guest memory use '0' to indicate "success, continue on" and not "exit to userspace", reporting memory faults/errors to userspace will set run->exit_reason and corresponding fields in the run structure fields in conjunction with a a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON. And because KVM already returns -EFAULT in many paths, there's a relatively high probability that KVM could return -EFAULT without setting run->exit_reason, in which case reporting KVM_EXIT_UNKNOWN is much better than reporting whatever exit reason happened to be in the run structure. Note, KVM must wait until after run->immediate_exit is serviced to sanitize run->exit_reason as KVM's ABI is that run->exit_reason is preserved across KVM_RUN when run->immediate_exit is true. Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/x86.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ee3cd8c3c0ef..f41dbb1465a0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu) { int r; + vcpu->run->exit_reason = KVM_EXIT_UNKNOWN; vcpu->arch.l1tf_flush_l1d = true; for (;;) { Reviewed-by: Paolo Bonzini
Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM
On 10/27/23 20:22, Sean Christopherson wrote: Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Reviewed-by: Paolo Bonzini
Re: [PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory
On 10/27/23 20:22, Sean Christopherson wrote: Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM. The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option. At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware. Signed-off-by: Sean Christopherson Reviewed-by: Paolo Bonzini with one nit: +- + +:Capability: KVM_CAP_MEMORY_ATTRIBUTES +:Architectures: x86 +:Type: system ioctl + +This capability returns a bitmap of support VM types. The 1-setting of bit @n s/support/supported/ Paolo
Re: [PATCH v13 00/35] KVM: guest_memfd() and per-page attributes
On 10/27/23 20:21, Sean Christopherson wrote: Non-KVM people, please take a gander at two small-ish patches buried in the middle of this series: fs: Export anon_inode_getfile_secure() for use by KVM mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Our plan/hope is to take this through the KVM tree for 6.8, reviews (and acks!) would be much appreciated. Note, adding AS_UNMOVABLE isn't strictly required as it's "just" an optimization, but we'd prefer to have it in place straightaway. Reporting what I wrote in the other thread, for wider distribution: I'm going to wait a couple days more for reviews to come in, post a v14 myself, and apply the series to kvm/next as soon as Linus merges the 6.7 changes. The series will be based on the 6.7 tags/for-linus, and when 6.7-rc1 comes up, I'll do this to straighten the history: git checkout kvm/next git tag -s -f kvm-gmem HEAD git reset --hard v6.7-rc1 git merge tags/kvm-gmem # fix conflict with Christian Brauner's VFS series git commit git push kvm 6.8 is not going to be out for four months, and I'm pretty sure that anything that would be discovered within "a few weeks" can also be applied on top, and the heaviness of a 35-patch series will outweigh any imperfections by a long margin. (Full disclosure: this is _also_ because I want to apply this series to the RHEL kernel, and Red Hat has a high level of disdain for non-upstream patches. But it's mostly because I want all dependencies to be able to move on and be developed on top of stock kvm/next). Paolo
Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
On Mon, Oct 30, 2023 at 9:53 AM David Matlack wrote: > > On 2023-10-27 11:21 AM, Sean Christopherson wrote: > > From: Chao Peng > > > > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end); > > What is the reason to separate range_add() from begin()? Nevermind, I see how it's needed in kvm_mmu_unmap_gfn_range().
Re: [PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry
On Mon, Oct 30, 2023 at 10:01 AM Paolo Bonzini wrote: > > On Mon, Oct 30, 2023 at 5:53 PM David Matlack wrote: > > > > On 2023-10-27 11:21 AM, Sean Christopherson wrote: > > > From: Chao Peng > > > > > > Currently in mmu_notifier invalidate path, hva range is recorded and > > > then checked against by mmu_notifier_retry_hva() in the page fault > > > handling path. However, for the to be introduced private memory, a page > > > > > > Is there a missing word here? > > No but there could be missing hyphens ("for the to-be-introduced > private memory"); possibly a "soon" could help parsing and that is > what you were talking about? Ah that explains it :) > > > > if (likely(kvm->mmu_invalidate_in_progress == 1)) { > > > + kvm->mmu_invalidate_range_start = INVALID_GPA; > > > + kvm->mmu_invalidate_range_end = INVALID_GPA; > > > > I don't think this is incorrect, but I was a little suprised to see this > > here rather than in end() when mmu_invalidate_in_progress decrements to > > 0. > > I think that would be incorrect on the very first start? Good point. KVM could initialize start/end before registering notifiers, but that's extra code.
Re: [PATCH v7 1/5] powerpc/code-patching: introduce patch_instructions()
Hi Aneesh, On 30/10/23 6:32 pm, Aneesh Kumar K.V wrote: Hari Bathini writes: patch_instruction() entails setting up pte, patching the instruction, clearing the pte and flushing the tlb. If multiple instructions need to be patched, every instruction would have to go through the above drill unnecessarily. Instead, introduce patch_instructions() function that sets up the pte, clears the pte and flushes the tlb only once per page range of instructions to be patched. Duplicate most of the patch_instruction() code instead of merging with it, to avoid the performance degradation observed on ppc32, for patch_instruction(), with the code path merged. Also, setup poking_init() always as BPF expects poking_init() to be setup even when STRICT_KERNEL_RWX is off. Signed-off-by: Hari Bathini Acked-by: Song Liu A lot of this is duplicate of patch_instruction(). Can we consolidate thing between them? True. The code was consolidated till v5 but had to duplicate most of it to avoid performance degradation reported on ppc32: https://lore.kernel.org/all/6cceb564-8b52-4d98-9118-92a914f48...@csgroup.eu/ --- Changes in v7: * Fixed crash observed with !STRICT_RWX. arch/powerpc/include/asm/code-patching.h | 1 + arch/powerpc/lib/code-patching.c | 141 ++- 2 files changed, 139 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h index 3f881548fb61..0e29ccf903d0 100644 --- a/arch/powerpc/include/asm/code-patching.h +++ b/arch/powerpc/include/asm/code-patching.h @@ -74,6 +74,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr, int patch_branch(u32 *addr, unsigned long target, int flags); int patch_instruction(u32 *addr, ppc_inst_t instr); int raw_patch_instruction(u32 *addr, ppc_inst_t instr); +int patch_instructions(u32 *addr, u32 *code, size_t len, bool repeat_instr); static inline unsigned long patch_site_addr(s32 *site) { diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index b00112d7ad46..e1c1fd9246d8 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -204,9 +204,6 @@ void __init poking_init(void) { int ret; - if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) - return; - if (mm_patch_enabled()) ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/text_poke_mm:online", @@ -378,6 +375,144 @@ int patch_instruction(u32 *addr, ppc_inst_t instr) } NOKPROBE_SYMBOL(patch_instruction); +static int __patch_instructions(u32 *patch_addr, u32 *code, size_t len, bool repeat_instr) +{ + unsigned long start = (unsigned long)patch_addr; + + /* Repeat instruction */ + if (repeat_instr) { + ppc_inst_t instr = ppc_inst_read(code); + + if (ppc_inst_prefixed(instr)) { + u64 val = ppc_inst_as_ulong(instr); + + memset64((u64 *)patch_addr, val, len / 8); + } else { + u32 val = ppc_inst_val(instr); + + memset32(patch_addr, val, len / 4); + } + } else { + memcpy(patch_addr, code, len); + } + + smp_wmb(); /* smp write barrier */ + flush_icache_range(start, start + len); + return 0; +} + +/* + * A page is mapped and instructions that fit the page are patched. + * Assumes 'len' to be (PAGE_SIZE - offset_in_page(addr)) or below. + */ +static int __do_patch_instructions_mm(u32 *addr, u32 *code, size_t len, bool repeat_instr) +{ + struct mm_struct *patching_mm, *orig_mm; + unsigned long pfn = get_patch_pfn(addr); + unsigned long text_poke_addr; + spinlock_t *ptl; + u32 *patch_addr; + pte_t *pte; + int err; + + patching_mm = __this_cpu_read(cpu_patching_context.mm); + text_poke_addr = __this_cpu_read(cpu_patching_context.addr); + patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr)); + + pte = get_locked_pte(patching_mm, text_poke_addr, &ptl); + if (!pte) + return -ENOMEM; + + __set_pte_at(patching_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL), 0); + + /* order PTE update before use, also serves as the hwsync */ + asm volatile("ptesync" ::: "memory"); + + /* order context switch after arbitrary prior code */ + isync(); + + orig_mm = start_using_temp_mm(patching_mm); + + err = __patch_instructions(patch_addr, code, len, repeat_instr); + + /* context synchronisation performed by __patch_instructions */ + stop_using_temp_mm(patching_mm, orig_mm); + + pte_clear(patching_mm, text_poke_addr, pte); + /* +* ptesync to order PTE update before TLB invalidation done +* by radix__local_flush_tlb_page_psize (in _tlbiel_va) +*/ + local_flush_tlb_page_psize(patchin
[PATCH] scsi: ibmvfc: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. We expect these fields to be NUL-terminated as the property names from which they are derived are also NUL-terminated. Moreover, NUL-padding is not required as our destination buffers are already NUL-allocated and any future NUL-byte assignments are redundant (like the ones that strncpy() does). ibmvfc_probe() -> | struct ibmvfc_host *vhost; | struct Scsi_Host *shost; ... | shost = scsi_host_alloc(&driver_template, sizeof(*vhost)); ... **side note: is this a bug? Looks like a type to me ^** ... | vhost = shost_priv(shost); ... where shost_priv() is: | static inline void *shost_priv(struct Scsi_Host *shost) | { | return (void *)shost->hostdata; | } .. and: scsi_host_alloc() -> | shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL); And for login_info->..., NUL-padding is also not required as it is explicitly memset to 0: | memset(login_info, 0, sizeof(*login_info)); Considering the above, a suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-harden...@vger.kernel.org Signed-off-by: Justin Stitt --- Note: build-tested only. Found with: $ rg "strncpy\(" --- drivers/scsi/ibmvscsi/ibmvfc.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index ce9eb00e2ca0..e73a39b1c832 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -1455,7 +1455,7 @@ static void ibmvfc_gather_partition_info(struct ibmvfc_host *vhost) name = of_get_property(rootdn, "ibm,partition-name", NULL); if (name) - strncpy(vhost->partition_name, name, sizeof(vhost->partition_name)); + strscpy(vhost->partition_name, name, sizeof(vhost->partition_name)); num = of_get_property(rootdn, "ibm,partition-no", NULL); if (num) vhost->partition_number = *num; @@ -1498,13 +1498,15 @@ static void ibmvfc_set_login_info(struct ibmvfc_host *vhost) login_info->async.va = cpu_to_be64(vhost->async_crq.msg_token); login_info->async.len = cpu_to_be32(async_crq->size * sizeof(*async_crq->msgs.async)); - strncpy(login_info->partition_name, vhost->partition_name, IBMVFC_MAX_NAME); - strncpy(login_info->device_name, - dev_name(&vhost->host->shost_gendev), IBMVFC_MAX_NAME); + strscpy(login_info->partition_name, vhost->partition_name, + sizeof(login_info->partition_name)); + + strscpy(login_info->device_name, + dev_name(&vhost->host->shost_gendev), sizeof(login_info->device_name)); location = of_get_property(of_node, "ibm,loc-code", NULL); location = location ? location : dev_name(vhost->dev); - strncpy(login_info->drc_name, location, IBMVFC_MAX_NAME); + strscpy(login_info->drc_name, location, sizeof(login_info->drc_name)); } /** --- base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa change-id: 20231030-strncpy-drivers-scsi-ibmvscsi-ibmvfc-c-ccfce3255d58 Best regards, -- Justin Stitt
Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
On Mon, Oct 30, 2023, Paolo Bonzini wrote: > On 10/27/23 20:21, Sean Christopherson wrote: > > > > + if (ioctl == KVM_SET_USER_MEMORY_REGION) > > + size = sizeof(struct kvm_userspace_memory_region); > > This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds > access of the commit message becomes a kernel stack read. Ouch. There's some irony. Might be worth doing memset(&mem, -1, sizeof(mem)) though as '0' is a valid file descriptor and a valid file offset. > Probably worth adding a check on valid flags here. Definitely needed. There's a very real bug here. But rather than duplicate flags checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we have the fancy guard(mutex) and there are no internal calls to kvm_set_memory_region(), what if we: 1. Acquire/release slots_lock in __kvm_set_memory_region() 2. Call kvm_set_memory_region() from x86 code for the internal memslots 3. Disallow *any* flags for internal memslots 4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region() 5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE only for KVM_SET_USER_MEMORY_REGION2 E.g. this over ~5 patches --- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 4 +-- virt/kvm/kvm_main.c | 65 +--- 3 files changed, 29 insertions(+), 42 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e3eb608b6692..dd3e2017366c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12478,7 +12478,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, m.guest_phys_addr = gpa; m.userspace_addr = hva; m.memory_size = size; - r = __kvm_set_memory_region(kvm, &m); + r = kvm_set_memory_region(kvm, &m); if (r < 0) return ERR_PTR_USR(r); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 687589ce9f63..fbb98efe8200 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1170,7 +1170,7 @@ static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter, gfn_ * -- just change its flags * * Since flags can be changed by some of these operations, the following - * differentiation is the best we can do for __kvm_set_memory_region(): + * differentiation is the best we can do for __kvm_set_memory_region(). */ enum kvm_mr_change { KVM_MR_CREATE, @@ -1181,8 +1181,6 @@ enum kvm_mr_change { int kvm_set_memory_region(struct kvm *kvm, const struct kvm_userspace_memory_region2 *mem); -int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region2 *mem); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 23633984142f..39ceee2f67f2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1608,28 +1608,6 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(struct kvm *kvm, -const struct kvm_userspace_memory_region2 *mem) -{ - u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; - - if (kvm_arch_has_private_mem(kvm)) - valid_flags |= KVM_MEM_PRIVATE; - - /* Dirty logging private memory is not currently supported. */ - if (mem->flags & KVM_MEM_PRIVATE) - valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES; - -#ifdef __KVM_HAVE_READONLY_MEM - valid_flags |= KVM_MEM_READONLY; -#endif - - if (mem->flags & ~valid_flags) - return -EINVAL; - - return 0; -} - static void kvm_swap_active_memslots(struct kvm *kvm, int as_id) { struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, as_id); @@ -2014,11 +1992,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * space. * * Discontiguous memory is allowed, mostly for framebuffers. - * - * Must be called holding kvm->slots_lock for write. */ -int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region2 *mem) +static int __kvm_set_memory_region(struct kvm *kvm, + const struct kvm_userspace_memory_region2 *mem) { struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; @@ -2028,9 +2004,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r; - r = check_memory_region_flags(kvm, mem); - if (r) - return r; + guard(mutex)(&kvm->slots_lock); as_id = mem->slot >> 16; id = (u16)mem->slot; @@ -2139,27 +2113,42 @@ int __kvm_set_memory_region(str
[linux-next:master] BUILD REGRESSION c503e3eec382ac708ee7adf874add37b77c5d312
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master branch HEAD: c503e3eec382ac708ee7adf874add37b77c5d312 Add linux-next specific files for 20231030 Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202310052201.anvbpgpr-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310121157.arky475m-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310171905.azfrkoid-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310261059.usl6vstf-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310302025.pokm9ves-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310302043.as36ufed-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310302206.pkr5ebdi-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310302338.mppqar10-...@intel.com https://lore.kernel.org/oe-kbuild-all/202310310055.rdwlonpr-...@intel.com Error/Warning: (recently discovered and may have been fixed) Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/iio/imu/bosch,bma400.yaml aarch64-linux-ld: drivers/cxl/core/pci.c:921:(.text+0xbbc): undefined reference to `pci_print_aer' arch/loongarch/kvm/mmu.c:411:6: warning: no previous prototype for 'kvm_unmap_gfn_range' [-Wmissing-prototypes] arch/loongarch/kvm/mmu.c:420:48: error: invalid use of undefined type 'struct kvm_gfn_range' arch/loongarch/kvm/mmu.c:422:1: error: control reaches end of non-void function [-Werror=return-type] arch/loongarch/kvm/mmu.c:424:6: warning: no previous prototype for 'kvm_set_spte_gfn' [-Wmissing-prototypes] arch/loongarch/kvm/mmu.c:456:6: warning: no previous prototype for 'kvm_age_gfn' [-Wmissing-prototypes] arch/loongarch/kvm/mmu.c:468:6: warning: no previous prototype for 'kvm_test_age_gfn' [-Wmissing-prototypes] arch/loongarch/kvm/mmu.c:787:24: error: 'struct kvm' has no member named 'mmu_invalidate_seq'; did you mean 'mn_invalidate_lock'? arch/loongarch/kvm/mmu.c:810:13: error: implicit declaration of function 'mmu_invalidate_retry_hva' [-Werror=implicit-function-declaration] arch/riscv/include/asm/mmio.h:67:(.text+0xd66): undefined reference to `pci_print_aer' csky-linux-ld: pci.c:(.text+0x6e8): undefined reference to `pci_print_aer' drivers/cxl/core/pci.c:921: undefined reference to `pci_print_aer' drivers/cxl/core/pci.c:921:(.text+0xbc0): undefined reference to `pci_print_aer' drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn35/dcn35_clk_mgr.c:1105 dcn35_clk_mgr_construct() warn: variable dereferenced before check 'ctx->dc_bios' (see line 1032) drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn35/dcn35_clk_mgr.c:1105 dcn35_clk_mgr_construct() warn: variable dereferenced before check 'ctx->dc_bios->integrated_info' (see line 1032) drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/smu_v13_0_6_ppt.c:286:45: warning: '%s' directive output may be truncated writing up to 29 bytes into a region of size 23 [-Wformat-truncation=] drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/smu_v13_0_6_ppt.c:286:52: warning: '%s' directive output may be truncated writing up to 29 bytes into a region of size 23 [-Wformat-truncation=] drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu14/smu_v14_0.c:72:45: warning: '%s' directive output may be truncated writing up to 29 bytes into a region of size 23 [-Wformat-truncation=] drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu14/smu_v14_0.c:72:52: warning: '%s' directive output may be truncated writing up to 29 bytes into a region of size 23 [-Wformat-truncation=] drivers/hwtracing/coresight/coresight-etm4x-core.c:1183:24: warning: result of comparison of constant 256 with expression of type 'u8' (aka 'unsigned char') is always false [-Wtautological-constant-out-of-range-compare] include/linux/atomic/atomic-arch-fallback.h:384:27: error: implicit declaration of function 'arch_cmpxchg_local'; did you mean 'raw_cmpxchg_local'? [-Werror=implicit-function-declaration] include/linux/compiler.h:212:29: error: pasting "__addressable_" and "(" does not give a valid preprocessing token include/linux/export.h:47:9: error: pasting "__export_symbol_" and "(" does not give a valid preprocessing token include/linux/stddef.h:8:15: error: expected declaration specifiers or '...' before '(' token include/linux/stddef.h:8:16: error: expected identifier or '(' before 'void' include/linux/stddef.h:8:23: error: expected ')' before numeric constant include/linux/stddef.h:8:24: error: pasting ")" and "218" does not give a valid preprocessing token kernel/bpf/task_iter.c:917:29: error: use of undeclared identifier 'CSS_TASK_ITER_THREADED' kernel/bpf/task_iter.c:917:7: error: use of undeclared iden
[PATCH] scsi: ibmvscsi: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. We expect partition_name to be NUL-terminated based on its usage with format strings: | dev_info(hostdata->dev, "host srp version: %s, " |"host partition %s (%d), OS %d, max io %u\n", |hostdata->madapter_info.srp_version, |hostdata->madapter_info.partition_name, |be32_to_cpu(hostdata->madapter_info.partition_number), |be32_to_cpu(hostdata->madapter_info.os_type), |be32_to_cpu(hostdata->madapter_info.port_max_txu[0])); ... | len = snprintf(buf, PAGE_SIZE, "%s\n", |hostdata->madapter_info.partition_name); Moreover, NUL-padding is not required as madapter_info is explicitly memset to 0: | memset(&hostdata->madapter_info, 0x00, | sizeof(hostdata->madapter_info)); Considering the above, a suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-harden...@vger.kernel.org Signed-off-by: Justin Stitt --- Note: build-tested only. Found with: $ rg "strncpy\(" --- drivers/scsi/ibmvscsi/ibmvscsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 59599299615d..71f3e9563520 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -266,7 +266,7 @@ static void set_adapter_info(struct ibmvscsi_host_data *hostdata) dev_info(hostdata->dev, "SRP_VERSION: %s\n", SRP_VERSION); strcpy(hostdata->madapter_info.srp_version, SRP_VERSION); - strncpy(hostdata->madapter_info.partition_name, partition_name, + strscpy(hostdata->madapter_info.partition_name, partition_name, sizeof(hostdata->madapter_info.partition_name)); hostdata->madapter_info.partition_number = --- base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa change-id: 20231030-strncpy-drivers-scsi-ibmvscsi-ibmvscsi-c-3b5019ce50ad Best regards, -- Justin Stitt
Re: [PATCH v8 00/30] Add support for QMC HDLC, framer infrastructure and PEF2256 framer
On Wed, Oct 25, 2023 at 12:32:15PM -0700, Jakub Kicinski wrote: > On Wed, 25 Oct 2023 17:00:51 +0200 Herve Codina wrote: > > > Which way will those patches go? Via some FSL SoC tree? > > > > This series seems mature now. > > What is the plan next in order to have it applied ? > > > > Don't hesitate to tell me if you prefer split series. > > FWIW we are happy to take the drivers/net/ parts if there is no hard > dependency. But there's no point taking that unless the SoC bits > also go in for 6.7. > > Li Yang, what are your expectations WRT merging this series? I think it is too late for SoC stuff for 6.7. I picked up binding patches 6, 7, and 8 because 6 and 7 are the same as an additionalProperties fix I have in my tree. As 8 depends on them, I just picked it up too. Rob
Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
On Mon, Oct 30, 2023, Sean Christopherson wrote: > On Mon, Oct 30, 2023, Chao Gao wrote: > > On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote: > > >From: Chao Peng > > > > > >In confidential computing usages, whether a page is private or shared is > > >necessary information for KVM to perform operations like page fault > > >handling, page zapping etc. There are other potential use cases for > > >per-page memory attributes, e.g. to make memory read-only (or no-exec, > > >or exec-only, etc.) without having to modify memslots. > > > > > >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow > > >userspace to operate on the per-page memory attributes. > > > - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to > > >a guest memory range. > > > > > - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported > > >memory attributes. > > > > This ioctl() is already removed. So, the changelog is out-of-date and needs > > an update. > > Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below. > > > >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES > > >+:Architectures: x86 > > >+:Type: vm ioctl > > >+:Parameters: struct kvm_memory_attributes(in) > > > >^ add one space here? > > Ah, yeah, that does appear to be the standard. > > > > > > >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm, > > >+struct kvm_gfn_range *range) > > >+{ > > >+ /* > > >+ * Unconditionally add the range to the invalidation set, regardless of > > >+ * whether or not the arch callback actually needs to zap SPTEs. E.g. > > >+ * if KVM supports RWX attributes in the future and the attributes are > > >+ * going from R=>RW, zapping isn't strictly necessary. Unconditionally > > >+ * adding the range allows KVM to require that MMU invalidations add at > > >+ * least one range between begin() and end(), e.g. allows KVM to detect > > >+ * bugs where the add() is missed. Rexlaing the rule *might* be safe, > > > > Relaxing > > > > >@@ -4640,6 +4850,17 @@ static int > > >kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > > > case KVM_CAP_BINARY_STATS_FD: > > > case KVM_CAP_SYSTEM_EVENT_DATA: > > > return 1; > > >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES > > >+ case KVM_CAP_MEMORY_ATTRIBUTES: > > >+ u64 attrs = kvm_supported_mem_attributes(kvm); > > >+ > > >+ r = -EFAULT; > > >+ if (copy_to_user(argp, &attrs, sizeof(attrs))) > > >+ goto out; > > >+ r = 0; > > >+ break; > > > > This cannot work, e.g., no @argp in this function and is fixed by a later > > commit: > > > > fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for > > guest-specific backing memory") > > I'll post a fixup patch for all of these, thanks much! Heh, that was an -ENOCOFFEE. Fixup patches for a changelog goof and an ephemeral bug are going to be hard to post. Paolo, do you want to take care of all of these fixups and typos, or would you prefer that I start a v14 branch and then hand it off to you at some point?
Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
On Mon, Oct 30, 2023, Paolo Bonzini wrote: > On 10/27/23 20:21, Sean Christopherson wrote: > > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t > > __kvm_handle_hva_range(struct kvm *kvm, > > * the second or later invocation of the handler). > > */ > > gfn_range.arg = range->arg; > > + > > + /* > > +* HVA-based notifications aren't relevant to private > > +* mappings as they don't have a userspace mapping. > > It's confusing who "they" is. Maybe > >* HVA-based notifications provide a userspace address, >* and as such are only relevant for shared mappings. Works for me.
Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
On Mon, Oct 30, 2023, Sean Christopherson wrote: > On Mon, Oct 30, 2023, Paolo Bonzini wrote: > > On 10/27/23 20:21, Sean Christopherson wrote: > > > > > > + if (ioctl == KVM_SET_USER_MEMORY_REGION) > > > + size = sizeof(struct kvm_userspace_memory_region); > > > > This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds > > access of the commit message becomes a kernel stack read. > > Ouch. There's some irony. Might be worth doing memset(&mem, -1, sizeof(mem)) > though as '0' is a valid file descriptor and a valid file offset. > > > Probably worth adding a check on valid flags here. > > Definitely needed. There's a very real bug here. But rather than duplicate > flags > checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we > have the fancy guard(mutex) and there are no internal calls to > kvm_set_memory_region(), > what if we: > > 1. Acquire/release slots_lock in __kvm_set_memory_region() Gah, this won't work with x86's usage, which acquire slots_lock quite far away from __kvm_set_memory_region() in several places. The rest of the idea still works, kvm_vm_ioctl_set_memory_region() will just need to take slots_lock manually. > 2. Call kvm_set_memory_region() from x86 code for the internal memslots > 3. Disallow *any* flags for internal memslots > 4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region() > 5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE > only for KVM_SET_USER_MEMORY_REGION2
Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
On 10/30/23 21:25, Sean Christopherson wrote: On Mon, Oct 30, 2023, Paolo Bonzini wrote: On 10/27/23 20:21, Sean Christopherson wrote: + if (ioctl == KVM_SET_USER_MEMORY_REGION) + size = sizeof(struct kvm_userspace_memory_region); This also needs a memset(&mem, 0, sizeof(mem)), otherwise the out-of-bounds access of the commit message becomes a kernel stack read. Ouch. There's some irony. Might be worth doing memset(&mem, -1, sizeof(mem)) though as '0' is a valid file descriptor and a valid file offset. Either is okay, because unless the flags check is screwed up it should not matter. The memset is actually unnecessary, though it may be a good idea anyway to keep it, aka belt-and-suspenders. Probably worth adding a check on valid flags here. Definitely needed. There's a very real bug here. But rather than duplicate flags checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we have the fancy guard(mutex) and there are no internal calls to kvm_set_memory_region(), what if we: 1. Acquire/release slots_lock in __kvm_set_memory_region() 2. Call kvm_set_memory_region() from x86 code for the internal memslots 3. Disallow *any* flags for internal memslots 4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region() I dislike this step, there is a clear point where all paths meet (ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region(). I think that's the place where flags should be checked. (I don't mind the restriction on internal memslots; it's just that to me it's not a particularly natural way to structure the checks). On the other hand, the place where to protect from out-of-bounds accesses, is the place where you stop caring about struct kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and your code gets it right, by dropping "ioctl" as soon as possible). diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 87f45aa91ced..fe5a2af14fff 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm) return true; } +/* + * Flags that do not access any of the extra space of struct + * kvm_userspace_memory_region2. KVM_SET_USER_MEMORY_REGION_FLAGS + * only allows these. + */ +#define KVM_SET_USER_MEMORY_REGION_FLAGS \ + (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY) + static int check_memory_region_flags(struct kvm *kvm, const struct kvm_userspace_memory_region2 *mem) { @@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp, struct kvm_userspace_memory_region2 mem; unsigned long size; - if (ioctl == KVM_SET_USER_MEMORY_REGION) + if (ioctl == KVM_SET_USER_MEMORY_REGION) { + /* +* Fields beyond struct kvm_userspace_memory_region shouldn't be +* accessed, but avoid leaking kernel memory in case of a bug. +*/ + memset(&mem, 0, sizeof(mem)); size = sizeof(struct kvm_userspace_memory_region); - else + } else { size = sizeof(struct kvm_userspace_memory_region2); + } /* Ensure the common parts of the two structs are identical. */ SANITY_CHECK_MEM_REGION_FIELD(slot); @@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp, if (copy_from_user(&mem, argp, size)) goto out; + r = -EINVAL; + if (ioctl == KVM_SET_USER_MEMORY_REGION && + (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS)) + goto out; + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } That's a kind of patch that you can't really get wrong (though I have the brown paper bag ready). Maintainance-wise it's fine, since flags are being added at a pace of roughly one every five years, and anyway it's also future proof: I placed the #define near check_memory_region_flags so that in five years we remember to keep it up to date. But worst case, the new flags will only be allowed by KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues waiting to bite us. In sum, this is exactly the only kind of fix that should be in the v13->v14 delta. Paolo
Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
On Tue, Oct 31, 2023, Paolo Bonzini wrote: > On 10/30/23 21:25, Sean Christopherson wrote: > > > Probably worth adding a check on valid flags here. > > > > Definitely needed. There's a very real bug here. But rather than > > duplicate flags > > checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that > > we > > have the fancy guard(mutex) and there are no internal calls to > > kvm_set_memory_region(), > > what if we: > > > >1. Acquire/release slots_lock in __kvm_set_memory_region() > >2. Call kvm_set_memory_region() from x86 code for the internal memslots > >3. Disallow *any* flags for internal memslots > >4. Open code check_memory_region_flags in > > kvm_vm_ioctl_set_memory_region() > > I dislike this step, there is a clear point where all paths meet > (ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region(). > I think that's the place where flags should be checked. (I don't mind > the restriction on internal memslots; it's just that to me it's not a > particularly natural way to structure the checks). Yeah, I just don't like the discrepancy it causes where some flags are explicitly checked and allowed, allowed and then later disallowed. > On the other hand, the place where to protect from out-of-bounds > accesses, is the place where you stop caring about struct > kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and > your code gets it right, by dropping "ioctl" as soon as possible). > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 87f45aa91ced..fe5a2af14fff 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm > *kvm) > return true; > } > +/* > + * Flags that do not access any of the extra space of struct > + * kvm_userspace_memory_region2. KVM_SET_USER_MEMORY_REGION_FLAGS > + * only allows these. > + */ > +#define KVM_SET_USER_MEMORY_REGION_FLAGS \ Can we name this KVM_SET_USER_MEMORY_REGION_LEGACY_FLAGS, or something equally horrific? As is, this sounds way too much like a generic "allowed flags for any memory region". Or maybe invert the macro? I.e. something to make it more obvious that it's effectively a versioning check, not a generic "what's supported?" check. #define KVM_SET_USER_MEMORY_FLAGS_V2_ONLY \ (~(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)) > + (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY) > + > static int check_memory_region_flags(struct kvm *kvm, >const struct kvm_userspace_memory_region2 > *mem) > { > @@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp, > struct kvm_userspace_memory_region2 mem; > unsigned long size; > - if (ioctl == KVM_SET_USER_MEMORY_REGION) > + if (ioctl == KVM_SET_USER_MEMORY_REGION) { > + /* > + * Fields beyond struct kvm_userspace_memory_region > shouldn't be > + * accessed, but avoid leaking kernel memory in case of > a bug. > + */ > + memset(&mem, 0, sizeof(mem)); > size = sizeof(struct kvm_userspace_memory_region); > - else > + } else { > size = sizeof(struct kvm_userspace_memory_region2); > + } > /* Ensure the common parts of the two structs are identical. */ > SANITY_CHECK_MEM_REGION_FIELD(slot); > @@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp, > if (copy_from_user(&mem, argp, size)) > goto out; > + r = -EINVAL; > + if (ioctl == KVM_SET_USER_MEMORY_REGION && > + (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS)) > + goto out; > + > r = kvm_vm_ioctl_set_memory_region(kvm, &mem); > break; > } > > > That's a kind of patch that you can't really get wrong (though I have > the brown paper bag ready). > > Maintainance-wise it's fine, since flags are being added at a pace of > roughly one every five years, Heh, true. > and anyway it's also future proof: I placed the #define near > check_memory_region_flags so that in five years we remember to keep it up to > date. But worst case, the new flags will only be allowed by > KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues > waiting to bite us. > > In sum, this is exactly the only kind of fix that should be in the v13->v14 > delta. Boiling the ocean can be fun too ;-)
Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2
On 10/28/2023 2:21 AM, Sean Christopherson wrote: Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional information can be supplied without setting userspace up to fail. The padding in the new kvm_userspace_memory_region2 structure will be used to pass a file descriptor in addition to the userspace_addr, i.e. allow userspace to point at a file descriptor and map memory into a guest that is NOT mapped into host userspace. Alternatively, KVM could simply add "struct kvm_userspace_memory_region2" without a new ioctl(), but as Paolo pointed out, adding a new ioctl() makes detection of bad flags a bit more robust, e.g. if the new fd field is guarded only by a flag and not a new ioctl(), then a userspace bug (setting a "bad" flag) would generate out-of-bounds access instead of an -EINVAL error. Cc: Jarkko Sakkinen Reviewed-by: Paolo Bonzini Reviewed-by: Xiaoyao Li Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/api.rst | 21 +++ arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 4 ++-- include/uapi/linux/kvm.h | 13 virt/kvm/kvm_main.c| 38 +++--- 5 files changed, 67 insertions(+), 11 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 21a7578142a1..ace984acc125 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG interface. No error will be returned, but the resulting offset will not be applied. +4.139 KVM_SET_USER_MEMORY_REGION2 +- + +:Capability: KVM_CAP_USER_MEMORY2 +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_userspace_memory_region2 (in) +:Returns: 0 on success, -1 on error + +:: + + struct kvm_userspace_memory_region2 { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; /* bytes */ + __u64 userspace_addr; /* start of the userspace allocated memory */ missing __u64 pad[16]; + }; + +See KVM_SET_USER_MEMORY_REGION. + 5. The kvm_run structure diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 41cce5031126..6409914428ca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12455,7 +12455,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_userspace_memory_region2 m; m.slot = id | (i << 16); m.flags = 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 5faba69403ac..4e741ff27af3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1146,9 +1146,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region2 *mem); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region2 *mem); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 13065dd96132..bd1abe067f28 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -95,6 +95,16 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +/* for KVM_SET_USER_MEMORY_REGION2 */ +struct kvm_userspace_memory_region2 { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; + __u64 userspace_addr; + __u64 pad[16]; +}; + /* * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for * userspace, other bits are reserved for kvm internal use which are defined @@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_COUNTER_OFFSET 227 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 +#define KVM_CAP_USER_MEMORY2 230 #ifdef KVM_CAP_IRQ_ROUTING @@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce { struct kvm_userspace_memory_region) #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64) +#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \ +struct kvm_userspace_memory_region2) /* enable ucontrol for s390 */ struct kvm_s390_ucas_mapping { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6e7
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On 10/28/2023 2:21 AM, Sean Christopherson wrote: ... +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that +allows mapping guest_memfd memory into a guest. All fields shared with +KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_PRIVATE in +flags to have KVM bind the memory region to a given guest_memfd range of +[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and +the target range must not be bound to any other memory region. All standard +bounds checks apply (use common sense). + :: struct kvm_userspace_memory_region2 { @@ -6087,9 +6096,24 @@ applied. __u64 guest_phys_addr; __u64 memory_size; /* bytes */ __u64 userspace_addr; /* start of the userspace allocated memory */ + __u64 guest_memfd_offset; missing a tab + __u32 guest_memfd; + __u32 pad1; + __u64 pad2[14]; };
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
>+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) >+{ >+ loff_t size = args->size; >+ u64 flags = args->flags; >+ u64 valid_flags = 0; >+ >+ if (flags & ~valid_flags) >+ return -EINVAL; >+ >+ if (size < 0 || !PAGE_ALIGNED(size)) >+ return -EINVAL; is size == 0 a valid case? >+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, >+unsigned int fd, loff_t offset) >+{ >+ loff_t size = slot->npages << PAGE_SHIFT; >+ unsigned long start, end; >+ struct kvm_gmem *gmem; >+ struct inode *inode; >+ struct file *file; >+ >+ BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff)); >+ >+ file = fget(fd); >+ if (!file) >+ return -EBADF; >+ >+ if (file->f_op != &kvm_gmem_fops) >+ goto err; >+ >+ gmem = file->private_data; >+ if (gmem->kvm != kvm) >+ goto err; >+ >+ inode = file_inode(file); >+ >+ if (offset < 0 || !PAGE_ALIGNED(offset)) >+ return -EINVAL; goto err; or hoist the check above fget(). >+ >+ if (offset + size > i_size_read(inode)) >+ goto err; >+ >+ filemap_invalidate_lock(inode->i_mapping); >+ >+ start = offset >> PAGE_SHIFT; >+ end = start + slot->npages; >+ >+ if (!xa_empty(&gmem->bindings) && >+ xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { >+ filemap_invalidate_unlock(inode->i_mapping); >+ goto err; >+ } >+ >+ /* >+ * No synchronize_rcu() needed, any in-flight readers are guaranteed to >+ * be see either a NULL file or this new file, no need for them to go >+ * away. >+ */ >+ rcu_assign_pointer(slot->gmem.file, file); >+ slot->gmem.pgoff = start; >+ >+ xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL); >+ filemap_invalidate_unlock(inode->i_mapping); >+ >+ /* >+ * Drop the reference to the file, even on success. The file pins KVM, >+ * not the other way 'round. Active bindings are invalidated if the >+ * file is closed before memslots are destroyed. >+ */ >+ fput(file); >+ return 0; >+ >+err: >+ fput(file); >+ return -EINVAL; The cleanup, i.e., filemap_invalidate_unlock() and fput(), is common. So, I think it may be slightly better to consolidate the common part e.g., int ret = -EINVAL; ... xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL); ret = 0; unlock: filemap_invalidate_unlock(inode->i_mapping); err: /* * Drop the reference to the file, even on success. The file pins KVM, * not the other way 'round. Active bindings are invalidated if the * file is closed before memslots are destroyed. */ fput(file); return ret;