On 6/29/26 14:47, Xiang Mei wrote:
> With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
> which an unprivileged user can surround with attacker-controlled data by
> spraying vmap allocations adjacent to a target stack (for example via
> XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
> guarded vmalloc allocation is followed by a single unmapped guard page.
>
> A single guard page is not enough to contain the x86_64 ENTER
> instruction used as a one-instruction stack pivot. ENTER imm16, imm8
> builds a stack frame and lowers RSP by:
>
> imm16 + 8 * (L + 1), L = imm8 & 0x1f
>
> imm16 is an unsigned 16-bit operand (ENTER never raises RSP), and L is
> in [0, 31], so the maximum displacement of a single ENTER is:
>
> 0xffff + 8 * 0x20 = 0x100ff bytes
This needs some more discussion of why this _specific_ instruction is so
important and why a good old 'add'. Peter asked about this on v1 and it
didn't make it into v2.
I think it boils down to ENTER doing a bunch of useful stuff setting up
a new frame in a single instruction. That single instruction is easier
to conjure up from another exploit or bad control flow than actually
setting up a stack frame.
But, really, if ENTER is so evil and nobody uses it, shouldn't we just
have an MSR bit somewhere to tell the CPU to #UD for it rather than
playing these stack games?
> That is more than enough to step off the current stack, across the
> one-page guard, and into the adjacent sprayed pages. When those pages
> contain a return sled feeding a ROP chain, reaching any ENTER gadget
> (opcode 0xc8, abundant as both intended and unintended gadgets) turns a
> control-flow hijack into full ROP execution without any register control
> at the hijack site, making it a one-gadget-style primitive that
> significantly eases exploitation. The pivot happens after the control
> transfer, so it is not constrained by CFI (kCFI/FineIBT).
This all sounds super theoretical.
I don't think we should mess with any of this without there being some
sign that this is an actual, practical juicy exploit target.
> Introduce a VMAP_GUARD_PAGES knob that defaults to a single page (no
> change for current architectures) and can be overridden per arch via
> asm/vmalloc.h, and set it to 0x11 on x86_64. This is deliberately scoped
> to x86_64: the 0x100ff bound is a property of the ENTER opcode, and ENTER
> is also a one-byte opcode (0xc8) that appears as abundant unintended
> gadgets. Other architectures (e.g. arm64) have no equivalent
> single-instruction, immediate-controlled pivot reachable as an unaligned
> unintended gadget, so they keep the one-page guard and pay no cost.
To even be considered, this series needs to be refactored properly.
Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.
> The override is gated on CONFIG_X86_64 rather than applying to all of x86:
> VMAP_STACK is selected only on x86_64, so 32-bit kernel stacks are not in
> the vmalloc area and the technique does not apply there. 32-bit x86 also
> has a far smaller vmalloc window, where widening every guarded area by 16
> pages would needlessly pressure the address space.
Shouldn't you condition it on HAVE_ARCH_VMAP_STACK, not X86_64 directly?
> The guard pages are never populated, so there is no extra physical
> memory and no additional page-table population beyond the larger virtual
> span; the cost is virtual address space and vmap_area bookkeeping, which
> is negligible against the 64-bit vmalloc window. get_vm_area_size() is
> adjusted by the same VMAP_GUARD_SIZE so the usable size reported to
> callers is unchanged.
Let's be thorough here, though, please. You're arguing that there's no
real cost to this. It's going to make the vmalloc() address space more
sparse and put pressure on the intermediate paging structure caches.
Whether that pressure matters is debatable.
But I do think you owe at least some rudimentary performance checks on this.
BTW, this is LLM-wordy. If you send another version of this, please work
on making it more concicse.
> On x86 this widens the guard for all guarded vmap areas, not only thread
> stacks. ret2enter targets the stack specifically, so a narrower
> alternative is to apply the wider guard only on the thread-stack
> allocation path via a dedicated VM_ flag; we kept the change in the
> common path as defense in depth for any vmalloc-adjacent pivot target,
> but are happy to scope it to stacks if maintainers prefer.
The simplest code thing for now is to just make it apply to all
vmalloc() allocations. That also theoretically has the largest impact,
but it's probably the best patch to start with.
> While widening the guard, also mark percpu vmap areas VM_NO_GUARD.
> pcpu_get_vm_areas() and pcpu_page_first_chunk() size each area exactly and
> reserve no guard, so get_vm_area_size() would subtract a guard that was
> never added and underflow if an area were smaller than the guard. This is
> a latent correctness fix only: on x86_64 percpu areas are megabyte-scale,
> far larger than the guard.
Honestly, I think this is just a sign that the code needs refactoring
rather than hacks.
If you go forward with this, I think vm_struct just needs a
area->guard_nr_pages. Then the internal users of the structure just set
area->size and the guard size. They don't have to fiddle with VM_NO_GUARD.
> +/*
> + * The x86 ENTER instruction can be used as a one-instruction stack pivot:
> + * ENTER imm16, imm8 lowers RSP by imm16 + 8 * (L + 1), L = imm8 & 0x1f.
> + * imm16 is an unsigned 16-bit operand (ENTER never raises RSP) and L is in
> + * [0, 31], so a single ENTER can lower RSP by at most
> + * 0xffff + 8 * 0x20 = 0x100ff bytes. With CONFIG_VMAP_STACK the kernel
> + * stack lives in the vmalloc area, where an unprivileged user can spray
> + * adjacent allocations; a single-page guard is too small to contain such a
> + * pivot. Use 0x11 guard pages (0x11000 bytes), the smallest whole-page
> + * span exceeding 0x100ff, so the pivot faults in the guard instead of
> + * landing in attacker-controlled memory.
> + *
> + * Restrict this to 64-bit: VMAP_STACK is selected only on x86_64, so 32-bit
> + * kernel stacks are not in the vmalloc area and the technique does not
> apply.
> + * 32-bit also has a far smaller vmalloc window, where a 16-page-per-area
> + * widening would needlessly pressure the address space.
> + */
> +#ifdef CONFIG_X86_64
> +#define VMAP_GUARD_PAGES 0x11
> +#endif
That comment is way too big. What is it protecting? What does the number
come from? We don't need to see the gory details in the comment.
/*
* Protect against control flow hijacks to gadgets using the ENTER
* instruction. Those can jump a bit over 64k on the stack so make the
* guard 64k+4k.
*/
#ifdef CONFIG_VMAP_STACK
#define VMAP_GUARD_PAGES 0x11
#endif
Right? What else do you really need?
> #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>
> #ifdef CONFIG_X86_64
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 3b02c0c6b371..b8546e519deb 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -49,6 +49,18 @@ struct iov_iter; /* in uio.h */
> #define IOREMAP_MAX_ORDER (7 + PAGE_SHIFT) /* 128 pages */
> #endif
>
> +/*
> + * Number of unmapped guard pages appended to each guarded vmalloc
> + * allocation. The default is a single page; an architecture may override
> + * VMAP_GUARD_PAGES (via asm/vmalloc.h) when a wider guard is needed to
> + * contain a worst-case single-instruction stack pivot into an adjacent,
> + * attacker-controlled vmap allocation (see arch/x86 for the ENTER case).
> + */
Heh, are you getting paid by the word here? These are way too verbose.
> +#ifndef VMAP_GUARD_PAGES
> +#define VMAP_GUARD_PAGES 1
> +#endif
> +#define VMAP_GUARD_SIZE (VMAP_GUARD_PAGES * PAGE_SIZE)
This could also be quite trivially expressed in Kconfig:
config VMAP_GUARD_PAGES
int
default 1
default ARCH_VMAP_GUARD_PAGES if ARCH_VMAP_GUARD_PAGES
> struct vm_struct {
> union {
> struct vm_struct *next; /* Early registration of vm_areas. */
> @@ -236,8 +248,8 @@ int vmap_pages_range(unsigned long addr, unsigned long
> end, pgprot_t prot,
> static inline size_t get_vm_area_size(const struct vm_struct *area)
> {
> if (!(area->flags & VM_NO_GUARD))
> - /* return actual size without guard page */
> - return area->size - PAGE_SIZE;
> + /* return actual size without guard region */
> + return area->size - VMAP_GUARD_SIZE;
> else
> return area->size;
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index b0676b8054ed..9f7262228be1 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -3243,7 +3243,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size,
> pcpu_fc_cpu_to_node_fn_t
> }
>
> /* allocate vm area, map the pages and copy static data */
> - vm.flags = VM_ALLOC;
> + vm.flags = VM_ALLOC | VM_NO_GUARD;
> vm.size = num_possible_cpus() * ai->unit_size;
> vm_area_register_early(&vm, PAGE_SIZE);
Yeah, I'd much rather see:
vm.size = num_possible_cpus() * ai->unit_size;
vm.guard = 0;
(or whatever we name the structure member) in cases like this.
So, yeah, this is a cute PoC hack. But it's gluing about 10 different
things into one patch instead of doing proper refactoring. Plus, I'm not
really even sure it's worth it in the first place.