On Mon, Jun 29, 2026 at 3:29 PM Dave Hansen <[email protected]> wrote:
>
> On 6/29/26 14:47, Xiang Mei wrote:
> > With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
> > which an unprivileged user can surround with attacker-controlled data by
> > spraying vmap allocations adjacent to a target stack (for example via
> > XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
> > guarded vmalloc allocation is followed by a single unmapped guard page.
> >
> > A single guard page is not enough to contain the x86_64 ENTER
> > instruction used as a one-instruction stack pivot. ENTER imm16, imm8
> > builds a stack frame and lowers RSP by:
> >
> > imm16 + 8 * (L + 1), L = imm8 & 0x1f
> >
> > imm16 is an unsigned 16-bit operand (ENTER never raises RSP), and L is
> > in [0, 31], so the maximum displacement of a single ENTER is:
> >
> > 0xffff + 8 * 0x20 = 0x100ff bytes
>
> This needs some more discussion of why this _specific_ instruction is so
> important and why a good old 'add'. Peter asked about this on v1 and it
> didn't make it into v2.
>
Thanks, I see it; we can also add more background introduction to the
commit message so readers can understand why ENTER is so special.
> I think it boils down to ENTER doing a bunch of useful stuff setting up
> a new frame in a single instruction. That single instruction is easier
> to conjure up from another exploit or bad control flow than actually
> setting up a stack frame.
>
> But, really, if ENTER is so evil and nobody uses it, shouldn't we just
> have an MSR bit somewhere to tell the CPU to #UD for it rather than
> playing these stack games?
>
I totally agree, and we'd like to see such a control. If I understand
correctly, that's a hardware/architectural change, so for current CPUs
we still need a software mitigation to harden the kernel.
> > That is more than enough to step off the current stack, across the
> > one-page guard, and into the adjacent sprayed pages. When those pages
> > contain a return sled feeding a ROP chain, reaching any ENTER gadget
> > (opcode 0xc8, abundant as both intended and unintended gadgets) turns a
> > control-flow hijack into full ROP execution without any register control
> > at the hijack site, making it a one-gadget-style primitive that
> > significantly eases exploitation. The pivot happens after the control
> > transfer, so it is not constrained by CFI (kCFI/FineIBT).
>
> This all sounds super theoretical.
>
> I don't think we should mess with any of this without there being some
> sign that this is an actual, practical juicy exploit target.
>
Yes, I am sorry to reuse some incorrect comments I copied from v1.
I'll remove the CFI-related content since we assume we already have
control flow hijacking.
Thanks for pointing it out.
> > Introduce a VMAP_GUARD_PAGES knob that defaults to a single page (no
> > change for current architectures) and can be overridden per arch via
> > asm/vmalloc.h, and set it to 0x11 on x86_64. This is deliberately scoped
> > to x86_64: the 0x100ff bound is a property of the ENTER opcode, and ENTER
> > is also a one-byte opcode (0xc8) that appears as abundant unintended
> > gadgets. Other architectures (e.g. arm64) have no equivalent
> > single-instruction, immediate-controlled pivot reachable as an unaligned
> > unintended gadget, so they keep the one-page guard and pay no cost.
>
> To even be considered, this series needs to be refactored properly.
> Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.
>
Good suggestion, I will do it in v3:
1/3 - introduce VMAP_GUARD_PAGES
2/3 - mark percpu vmap areas VM_NO_GUARD
3/3 - set VMAP_GUARD_PAGES to 0x11 on x86_64
> > The override is gated on CONFIG_X86_64 rather than applying to all of x86:
> > VMAP_STACK is selected only on x86_64, so 32-bit kernel stacks are not in
> > the vmalloc area and the technique does not apply there. 32-bit x86 also
> > has a far smaller vmalloc window, where widening every guarded area by 16
> > pages would needlessly pressure the address space.
>
> Shouldn't you condition it on HAVE_ARCH_VMAP_STACK, not X86_64 directly?
>
Yup, that's better; we'll do the following in v3:
```
#if defined(CONFIG_X86_64) && defined(CONFIG_VMAP_STACK)
#define VMAP_GUARD_PAGES 0x11
#endif
```
> > The guard pages are never populated, so there is no extra physical
> > memory and no additional page-table population beyond the larger virtual
> > span; the cost is virtual address space and vmap_area bookkeeping, which
> > is negligible against the 64-bit vmalloc window. get_vm_area_size() is
> > adjusted by the same VMAP_GUARD_SIZE so the usable size reported to
> > callers is unchanged.
>
> Let's be thorough here, though, please. You're arguing that there's no
> real cost to this. It's going to make the vmalloc() address space more
> sparse and put pressure on the intermediate paging structure caches.
> Whether that pressure matters is debatable.
>
> But I do think you owe at least some rudimentary performance checks on this.
>
I'll do some cost tests before sending v3 and the correct statement.
> BTW, this is LLM-wordy. If you send another version of this, please work
> on making it more concicse.
>
The v3 comments would be concise.
> > On x86 this widens the guard for all guarded vmap areas, not only thread
> > stacks. ret2enter targets the stack specifically, so a narrower
> > alternative is to apply the wider guard only on the thread-stack
> > allocation path via a dedicated VM_ flag; we kept the change in the
> > common path as defense in depth for any vmalloc-adjacent pivot target,
> > but are happy to scope it to stacks if maintainers prefer.
>
> The simplest code thing for now is to just make it apply to all
> vmalloc() allocations. That also theoretically has the largest impact,
> but it's probably the best patch to start with.
>
> > While widening the guard, also mark percpu vmap areas VM_NO_GUARD.
> > pcpu_get_vm_areas() and pcpu_page_first_chunk() size each area exactly and
> > reserve no guard, so get_vm_area_size() would subtract a guard that was
> > never added and underflow if an area were smaller than the guard. This is
> > a latent correctness fix only: on x86_64 percpu areas are megabyte-scale,
> > far larger than the guard.
>
> Honestly, I think this is just a sign that the code needs refactoring
> rather than hacks.
>
> If you go forward with this, I think vm_struct just needs a
> area->guard_nr_pages. Then the internal users of the structure just set
> area->size and the guard size. They don't have to fiddle with VM_NO_GUARD.
>
Agree, the problem exists before our patch (get_vm_area_size() is
already one page less).
> > +/*
> > + * The x86 ENTER instruction can be used as a one-instruction stack pivot:
> > + * ENTER imm16, imm8 lowers RSP by imm16 + 8 * (L + 1), L = imm8 & 0x1f.
> > + * imm16 is an unsigned 16-bit operand (ENTER never raises RSP) and L is in
> > + * [0, 31], so a single ENTER can lower RSP by at most
> > + * 0xffff + 8 * 0x20 = 0x100ff bytes. With CONFIG_VMAP_STACK the kernel
> > + * stack lives in the vmalloc area, where an unprivileged user can spray
> > + * adjacent allocations; a single-page guard is too small to contain such a
> > + * pivot. Use 0x11 guard pages (0x11000 bytes), the smallest whole-page
> > + * span exceeding 0x100ff, so the pivot faults in the guard instead of
> > + * landing in attacker-controlled memory.
> > + *
> > + * Restrict this to 64-bit: VMAP_STACK is selected only on x86_64, so
> > 32-bit
> > + * kernel stacks are not in the vmalloc area and the technique does not
> > apply.
> > + * 32-bit also has a far smaller vmalloc window, where a 16-page-per-area
> > + * widening would needlessly pressure the address space.
> > + */
> > +#ifdef CONFIG_X86_64
> > +#define VMAP_GUARD_PAGES 0x11
> > +#endif
>
> That comment is way too big. What is it protecting? What does the number
> come from? We don't need to see the gory details in the comment.
>
> /*
> * Protect against control flow hijacks to gadgets using the ENTER
> * instruction. Those can jump a bit over 64k on the stack so make the
> * guard 64k+4k.
> */
> #ifdef CONFIG_VMAP_STACK
> #define VMAP_GUARD_PAGES 0x11
> #endif
>
> Right? What else do you really need?
>
Got it, simple comments and thanks for the example.
> > #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> >
> > #ifdef CONFIG_X86_64
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 3b02c0c6b371..b8546e519deb 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -49,6 +49,18 @@ struct iov_iter; /* in uio.h */
> > #define IOREMAP_MAX_ORDER (7 + PAGE_SHIFT) /* 128 pages */
> > #endif
> >
> > +/*
> > + * Number of unmapped guard pages appended to each guarded vmalloc
> > + * allocation. The default is a single page; an architecture may override
> > + * VMAP_GUARD_PAGES (via asm/vmalloc.h) when a wider guard is needed to
> > + * contain a worst-case single-instruction stack pivot into an adjacent,
> > + * attacker-controlled vmap allocation (see arch/x86 for the ENTER case).
> > + */
>
> Heh, are you getting paid by the word here? These are way too verbose.
>
I'll clean them in v3.
> > +#ifndef VMAP_GUARD_PAGES
> > +#define VMAP_GUARD_PAGES 1
> > +#endif
> > +#define VMAP_GUARD_SIZE (VMAP_GUARD_PAGES * PAGE_SIZE)
>
> This could also be quite trivially expressed in Kconfig:
>
> config VMAP_GUARD_PAGES
> int
> default 1
> default ARCH_VMAP_GUARD_PAGES if ARCH_VMAP_GUARD_PAGES
>
Good suggestion, it'll be in v3.
> > struct vm_struct {
> > union {
> > struct vm_struct *next; /* Early registration of vm_areas.
> > */
> > @@ -236,8 +248,8 @@ int vmap_pages_range(unsigned long addr, unsigned long
> > end, pgprot_t prot,
> > static inline size_t get_vm_area_size(const struct vm_struct *area)
> > {
> > if (!(area->flags & VM_NO_GUARD))
> > - /* return actual size without guard page */
> > - return area->size - PAGE_SIZE;
> > + /* return actual size without guard region */
> > + return area->size - VMAP_GUARD_SIZE;
> > else
> > return area->size;
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index b0676b8054ed..9f7262228be1 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -3243,7 +3243,7 @@ int __init pcpu_page_first_chunk(size_t
> > reserved_size, pcpu_fc_cpu_to_node_fn_t
> > }
> >
> > /* allocate vm area, map the pages and copy static data */
> > - vm.flags = VM_ALLOC;
> > + vm.flags = VM_ALLOC | VM_NO_GUARD;
> > vm.size = num_possible_cpus() * ai->unit_size;
> > vm_area_register_early(&vm, PAGE_SIZE);
>
> Yeah, I'd much rather see:
>
> vm.size = num_possible_cpus() * ai->unit_size;
> vm.guard = 0;
>
> (or whatever we name the structure member) in cases like this.
>
> So, yeah, this is a cute PoC hack. But it's gluing about 10 different
> things into one patch instead of doing proper refactoring. Plus, I'm not
> really even sure it's worth it in the first place.
Yeah, I'll separate them into a series of patches. I believe it's a
worthy patch since it's a huge benefit to have a generic
one-gadget-style gadget. We have tested this one-gadget-style stack
pivoting on real CVEs and got the exploitation working without
additional changes.
Thanks for your review. I'll do the benchmarking and deliver v3.
Xiang