from:"Brendan Jackman"

[PATCH RFC v2 01/29] mm: asi: Make some utility functions noinstr compatible

2025-01-10 Thread Brendan Jackman

Some existing utility functions would need to be called from a noinstr
context in the later patches. So mark these as either noinstr or
__always_inline.

An earlier version of this by Junaid had a macro that was intended to
tell the compiler "either inline this function, or call it in the
noinstr section", which basically boiled down to:

 #define inline_or_noinstr noinline __section(".noinstr.text")

Unfortunately Thomas pointed out this will prevent the function from
being inlined at call sites in .text.

So far I haven't been able[1] to find a formulation that lets us :
1. avoid calls from .noinstr.text -> .text,
2. while also letting the compiler freely decide what to inline.

1 is a functional requirement so here I'm just giving up on 2. Existing
callsites of this code are just forced inline. For the incoming code
that needs to call it from noinstr, they will be out-of-line calls.

[1] 
https://lore.kernel.org/lkml/ca+i-1c1z35m8wa_4awmq7--c1ogjnolgtkn4+td5gkg7qqa...@mail.gmail.com/

Checkpatch-args: --ignore=COMMIT_LOG_LONG_LINE
Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/processor.h |  2 +-
 arch/x86/include/asm/special_insns.h |  8 
 arch/x86/include/asm/tlbflush.h  |  3 +++
 arch/x86/mm/tlb.c| 13 +
 4 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 
4a686f0e5dbf6d906ed38276148b186e920927b3..1a1b7ea5d7d32a47d783d9d62cd2a53672addd6f
 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -220,7 +220,7 @@ void print_cpu_msr(struct cpuinfo_x86 *);
 /*
  * Friendlier CR3 helpers.
  */
-static inline unsigned long read_cr3_pa(void)
+static __always_inline unsigned long read_cr3_pa(void)
 {
return __read_cr3() & CR3_ADDR_MASK;
 }
diff --git a/arch/x86/include/asm/special_insns.h 
b/arch/x86/include/asm/special_insns.h
index 
aec6e2d3aa1d52e5c8f513e188015a45e9eeaeb2..6e103358966f6f1333aa07be97aec5f8af794120
 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -42,14 +42,14 @@ static __always_inline void native_write_cr2(unsigned long 
val)
asm volatile("mov %0,%%cr2": : "r" (val) : "memory");
 }
 
-static inline unsigned long __native_read_cr3(void)
+static __always_inline unsigned long __native_read_cr3(void)
 {
unsigned long val;
asm volatile("mov %%cr3,%0\n\t" : "=r" (val) : __FORCE_ORDER);
return val;
 }
 
-static inline void native_write_cr3(unsigned long val)
+static __always_inline void native_write_cr3(unsigned long val)
 {
asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
 }
@@ -153,12 +153,12 @@ static __always_inline void write_cr2(unsigned long x)
  * Careful!  CR3 contains more than just an address.  You probably want
  * read_cr3_pa() instead.
  */
-static inline unsigned long __read_cr3(void)
+static __always_inline unsigned long __read_cr3(void)
 {
return __native_read_cr3();
 }
 
-static inline void write_cr3(unsigned long x)
+static __always_inline void write_cr3(unsigned long x)
 {
native_write_cr3(x);
 }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 
69e79fff41b800a0a138bcbf548dde9d72993105..c884174a44e119a3c027c44ada6c5cdba14d1282
 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -423,4 +423,7 @@ static inline void __native_tlb_flush_global(unsigned long 
cr4)
native_write_cr4(cr4 ^ X86_CR4_PGE);
native_write_cr4(cr4);
 }
+
+unsigned long build_cr3_noinstr(pgd_t *pgd, u16 asid, unsigned long lam);
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 
86593d1b787d8a5b9fa4bd492356898ec8870938..f0428e5e1f1947903ee87c4c6444844ee11b45c3
 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -108,7 +108,7 @@
 /*
  * Given @asid, compute kPCID
  */
-static inline u16 kern_pcid(u16 asid)
+static __always_inline u16 kern_pcid(u16 asid)
 {
VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 
@@ -153,9 +153,9 @@ static inline u16 user_pcid(u16 asid)
return ret;
 }
 
-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam)
+static __always_inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned 
long lam)
 {
-   unsigned long cr3 = __sme_pa(pgd) | lam;
+   unsigned long cr3 = __sme_pa_nodebug(pgd) | lam;
 
if (static_cpu_has(X86_FEATURE_PCID)) {
cr3 |= kern_pcid(asid);
@@ -166,6 +166,11 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 
asid, unsigned long lam)
return cr3;
 }
 
+noinstr unsigned long build_cr3_noinstr(pgd_t *pgd, u16 asid, unsigned long 
lam)
+{
+   return build_cr3(pgd, asid, lam);
+}
+
 static inline unsigned long build_cr3_nofl

[PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

2025-01-10 Thread Brendan Jackman

Introduce core API for Address Space Isolation (ASI).  Kernel address
space isolation provides the ability to run some kernel
code with a restricted kernel address space.

There can be multiple classes of such restricted kernel address spaces
(e.g. KPTI, KVM-PTI etc.). Each ASI class is identified by an index.
The ASI class can register some hooks to be called when
entering/exiting the restricted address space.

Currently, there is a fixed maximum number of ASI classes supported.
In addition, each process can have at most one restricted address space
from each ASI class. Neither of these are inherent limitations and
are merely simplifying assumptions for the time being.

To keep things simpler for the time being, we disallow context switches
within the restricted address space. In the future, we would be able to
relax this limitation for the case of context switches to different
threads within the same process (or to the idle thread and back).

Note that this doesn't really support protecting sibling VM guests
within the same VMM process from one another. From first principles
it seems unlikely that anyone who cares about VM isolation would do
that, but there could be a use-case to think about. In that case need
something like the OTHER_MM logic might be needed, but specific to
intra-process guest separation.

[0]:
https://lore.kernel.org/kvm/1562855138-19507-1-git-send-email-alexandre.char...@oracle.com

Notes about RFC-quality implementation details:

 - Ignoring checkpatch.pl AVOID_BUG.
 - The dynamic registration of classes might be pointless complexity.
   This was kept from RFCv1 without much thought.
 - The other-mm logic is also perhaps overly complex, suggestions are
   welcome for how best to tackle this (or we could just forget about
   it for the moment, and rely on asi_exit() happening in process
   switch).
 - The taint flag definitions would probably be clearer with an enum or
   something.

Checkpatch-args: --ignore=AVOID_BUG,COMMIT_LOG_LONG_LINE,EXPORT_SYMBOL
Co-developed-by: Ofir Weisse 
Signed-off-by: Ofir Weisse 
Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/asi.h   | 208 +++
 arch/x86/include/asm/processor.h |   8 +
 arch/x86/mm/Makefile |   1 +
 arch/x86/mm/asi.c| 350 +++
 arch/x86/mm/init.c   |   3 +-
 arch/x86/mm/tlb.c|   1 +
 include/asm-generic/asi.h|  67 
 include/linux/mm_types.h |   7 +
 kernel/fork.c|   3 +
 kernel/sched/core.c  |   9 +
 mm/init-mm.c |   4 +
 11 files changed, 660 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 
..7cc635b6653a3970ba9dbfdc9c828a470e27bd44
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,208 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_ASI_H
+#define _ASM_X86_ASI_H
+
+#include 
+
+#include 
+
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+
+/*
+ * Overview of API usage by ASI clients:
+ *
+ * Setup: First call asi_init() to create a domain. At present only one domain
+ * can be created per mm per class, but it's safe to asi_init() this domain
+ * multiple times. For each asi_init() call you must call asi_destroy() AFTER
+ * you are certain all CPUs have exited the restricted address space (by
+ * calling asi_exit()).
+ *
+ * Runtime usage:
+ *
+ * 1. Call asi_enter() to switch to the restricted address space. This can't be
+ *from an interrupt or exception handler and preemption must be disabled.
+ *
+ * 2. Execute untrusted code.
+ *
+ * 3. Call asi_relax() to inform the ASI subsystem that untrusted code 
execution
+ *is finished. This doesn't cause any address space change. This can't be
+ *from an interrupt or exception handler and preemption must be disabled.
+ *
+ * 4. Either:
+ *
+ *a. Go back to 1.
+ *
+ *b. Call asi_exit() before returning to userspace. This immediately
+ *   switches to the unrestricted address space.
+ *
+ * The region between 1 and 3 is called the "ASI critical section". During the
+ * critical section, it is a bug to access any sensitive data, and you mustn't
+ * sleep.
+ *
+ * The restriction on sleeping is not really a fundamental property of ASI.
+ * However for performance reasons it's important that the critical section is
+ * absolutely as short as possible. So the ability to do sleepy things like
+ * taking mutexes oughtn't to confer any convenience on API users.
+ *
+ * Similarly to the issue of sleeping, the need to asi_exit in case 4b is not a
+ * fundamental property of the system but a limitation of the current
+ * implementation. With further work it is possible to context switch
+ * f

[PATCH RFC v2 05/29] mm: asi: ASI support in interrupts/exceptions

2025-01-10 Thread Brendan Jackman

Add support for potentially switching address spaces from within
interrupts/exceptions/NMIs etc. An interrupt does not automatically
switch to the unrestricted address space. It can switch if needed to
access some memory not available in the restricted address space, using
the normal asi_exit call.

On return from the outermost interrupt, if the target address space was
the restricted address space (e.g. we were in the critical code path
between ASI Enter and VM Enter), the restricted address space will be
automatically restored. Otherwise, execution will continue in the
unrestricted address space until the next explicit ASI Enter.

In order to keep track of when to restore the restricted address space,
an interrupt/exception nesting depth counter is maintained per-task.
An alternative implementation without needing this counter is also
possible, but the counter unlocks an additional nice-to-have benefit by
allowing detection of whether or not we are currently executing inside
an exception context, which would be useful in a later patch.

Note that for KVM on SVM, this is not actually necessary as NMIs are in
fact maskable via CLGI. It's not clear to me if VMX has something
equivalent but we will need this infrastructure in place for userspace
support anyway.

RFC: Once userspace ASI is implemented, this idtentry integration
looks a bit heavy-handed. For example, we don't need this logic
for INT 80 emulation, so having it in DEFINE_IDTENTRY_RAW is
confusing. It could lead to a bug if the order of interrupter counter
modifications and ASI transition logic gets flipped around somehow.

checkpatch.pl SPACING is false positive. AVOID_BUG ignored for RFC.

Checkpatch-args: --ignore=SPACING,AVOID_BUG
Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/asi.h   | 68 ++--
 arch/x86/include/asm/idtentry.h  | 50 -
 arch/x86/include/asm/processor.h |  5 +++
 arch/x86/kernel/process.c|  2 ++
 arch/x86/kernel/traps.c  | 22 +
 arch/x86/mm/asi.c|  7 -
 include/asm-generic/asi.h| 10 ++
 7 files changed, 153 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
b9671ef2dd3278adceed18507fd260e21954d574..9a9a139518289fc65f26a4d1cd311aa52cc5357f
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -157,6 +157,11 @@ void asi_relax(void);
 /* Immediately exit the restricted address space if in it */
 void asi_exit(void);
 
+static inline void asi_init_thread_state(struct thread_struct *thread)
+{
+   thread->asi_state.intr_nest_depth = 0;
+}
+
 /* The target is the domain we'll enter when returning to process context. */
 static __always_inline struct asi *asi_get_target(struct task_struct *p)
 {
@@ -197,9 +202,10 @@ static __always_inline bool asi_is_relaxed(void)
 /*
  * Is the current task in the critical section?
  *
- * This is just the inverse of !asi_is_relaxed(). We have both functions in 
order to
- * help write intuitive client code. In particular, asi_is_tense returns false
- * when ASI is disabled, which is judged to make user code more obvious.
+ * This is just the inverse of !asi_is_relaxed(). We have both functions in
+ * order to help write intuitive client code. In particular, asi_is_tense
+ * returns false when ASI is disabled, which is judged to make user code more
+ * obvious.
  */
 static __always_inline bool asi_is_tense(void)
 {
@@ -211,6 +217,62 @@ static __always_inline pgd_t *asi_pgd(struct asi *asi)
return asi ? asi->pgd : NULL;
 }
 
+static __always_inline void asi_intr_enter(void)
+{
+   if (static_asi_enabled() && asi_is_tense()) {
+   current->thread.asi_state.intr_nest_depth++;
+   barrier();
+   }
+}
+
+void __asi_enter(void);
+
+static __always_inline void asi_intr_exit(void)
+{
+   if (static_asi_enabled() && asi_is_tense()) {
+   /*
+* If an access to sensitive memory got reordered after the
+* decrement, the #PF handler for that access would see a value
+* of 0 for the counter and re-__asi_enter before returning to
+* the faulting access, triggering an infinite PF loop.
+*/
+   barrier();
+
+   if (--current->thread.asi_state.intr_nest_depth == 0) {
+   /*
+* If the decrement got reordered after __asi_enter, an
+* interrupt that came between __asi_enter and the
+* decrement would always see a nonzero value for the
+* counter so it wouldn't call __asi_enter again and we
+* would return to process context in the wrong address
+* space.
+*/
+

[PATCH RFC v2 00/29] Address Space Isolation (ASI)

2025-01-10 Thread Brendan Jackman

 advantage of this? (I think yes). How does this influence the
   design of the bugs.c kernel arguments?

 - Suggestions on how to map file pages into processes that can read
   them, while minimizing TLB management pain.

Finally, a more extensive branch can be found at [3]. It has some tests
and some of the lower-hanging fruit for optimising performance of KVM
guests.

[0] RFC v1:

https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a4...@google.com/

[1] LPC session: https://lpc.events/event/18/contributions/1761/

[2] Junaid’s RFC:
https://lore.kernel.org/all/20220223052223.1202152-1-juna...@google.com/

[3] GitHub branch:
https://github.com/googleprodkernel/linux-kvm/tree/asi-rfcv2-preview

Signed-off-by: Brendan Jackman 

Ingo Molnar , Borislav Petkov ,
 Dave Hansen ,
"H. Peter Anvin" ,
 Andy Lutomirski ,
Peter Zijlstra ,
 Sean Christopherson ,
Paolo Bonzini ,
 Alexandre Chartre ,
Liran Alon ,
 Jan Setje-Eilers ,
Catalin Marinas ,
 Will Deacon ,
Mark Rutland ,
 Andrew Morton ,
Mel Gorman ,
 Lorenzo Stoakes ,
David Hildenbrand ,
Vlastimil Babka ,
 Michal Hocko ,
Khalid Aziz ,
 Juri Lelli ,
Vincent Guittot ,
 Dietmar Eggemann ,
Steven Rostedt ,
 Valentin Schneider ,
Paul Turner , Reiji Watanabe ,
 Junaid Shahid ,
Ofir Weisse ,
 Yosry Ahmed ,
Patrick Bellasi ,
 KP Singh ,
Alexandra Sandulescu ,
 Matteo Rizzo ,
Jann Horn 
 k...@vger.kernel.org, Brendan Jackman ,
Dennis Zhou 

---
Changes in v2:
- Added support for sandboxing userspace processes.
- Link to v1: 
https://lore.kernel.org/r/20240712-asi-rfc-24-v1-0-144b319a4...@google.com

---
Brendan Jackman (21):
  mm: asi: Make some utility functions noinstr compatible
  x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
  mm: asi: Introduce ASI core API
  mm: asi: Add infrastructure for boot-time enablement
  mm: asi: ASI support in interrupts/exceptions
  mm: asi: Avoid warning from NMI userspace accesses in ASI context
  mm: Add __PAGEFLAG_FALSE
  mm: asi: Map non-user buddy allocations as nonsensitive
  [TEMP WORKAROUND] mm: asi: Workaround missing partial-unmap support
  mm: asi: Map kernel text and static data as nonsensitive
  mm: asi: Map vmalloc/vmap data as nonsensitive
  mm: asi: Stabilize CR3 in switch_mm_irqs_off()
  mm: asi: Make TLB flushing correct under ASI
  KVM: x86: asi: Restricted address space for VM execution
  mm: asi: exit ASI before accessing CR3 from C code where appropriate
  mm: asi: Add infrastructure for mapping userspace addresses
  mm: asi: Restricted execution fore bare-metal processes
  x86: Create library for flushing L1D for L1TF
  mm: asi: Add some mitigations on address space transitions
  x86/pti: Disable PTI when ASI is on
  mm: asi: Stop ignoring asi=on cmdline flag

Junaid Shahid (4):
  mm: asi: Make __get_current_cr3_fast() ASI-aware
  mm: asi: ASI page table allocation functions
  mm: asi: Functions to map/unmap a memory range into ASI page tables
  mm: asi: Add basic infrastructure for global non-sensitive mappings

Ofir Weisse (1):
  mm: asi: asi_exit() on PF, skip handling if address is accessible

Reiji Watanabe (1):
  mm: asi: Map dynamic percpu memory as nonsensitive

Yosry Ahmed (2):
  mm: asi: Use separate PCIDs for restricted address spaces
  mm: asi: exit ASI before suspend-like operations

 arch/alpha/include/asm/Kbuild|1 +
 arch/arc/include/asm/Kbuild  |1 +
 arch/arm/include/asm/Kbuild  |1 +
 arch/arm64/include/asm/Kbuild|1 +
 arch/csky/include/asm/Kbuild |1 +
 arch/hexagon/include/asm/Kbuild  |1 +
 arch/loongarch/include/asm/Kbuild|3 +
 arch/m68k/include/asm/Kbuild |1 +
 arch/microblaze/include/asm/Kbuild   |1 +
 arch/mips/include/asm/Kbuild |1 +
 arch/nios2/include/asm/Kbuild|1 +
 arch/openrisc/include/asm/Kbuild |1 +
 arch/parisc/include/asm/Kbuild   |1 +
 arch/powerpc/include/asm/Kbuild  |1 +
 arch/riscv/include/asm/Kbuild|1 +
 arch/s390/include/asm/Kbuild |1 +
 arch/sh/include/asm/Kbuild   |1 +
 arch/sparc/include/asm/Kbuild|1 +
 arch/um/include/asm/Kbuild   |2 +-
 arch/x86/Kconfig |   27 +
 arch/x86/boot/compressed/ident_map_64.c  |   10 +
 arch/x86/boot/compressed/pgtable_64.c|   11 +
 arch/x86/include/asm/asi.h   |  306 +
 arch/x86/include/asm/cpufeatures.h   |1 +
 arch/x86/include/asm/disabled-features.h |8 +-
 arch/x86/include/a

[PATCH RFC v2 06/29] mm: asi: Use separate PCIDs for restricted address spaces

2025-01-10 Thread Brendan Jackman

From: Yosry Ahmed 

Each restricted address space is assigned a separate PCID. Since
currently only one ASI instance per-class exists for a given process,
the PCID is just derived from the class index.

This commit only sets the appropriate PCID when switching CR3, but does
not actually use the NOFLUSH bit. That will be done by later patches.

Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Signed-off-by: Yosry Ahmed 
Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/asi.h |  4 +--
 arch/x86/include/asm/processor-flags.h | 24 +
 arch/x86/include/asm/tlbflush.h|  3 +++
 arch/x86/mm/asi.c  | 10 +++
 arch/x86/mm/tlb.c  | 49 +++---
 include/asm-generic/asi.h  |  2 ++
 6 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
9a9a139518289fc65f26a4d1cd311aa52cc5357f..a55e73f1b2bc84c41b9ab25f642a4d5f1aa6ba90
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -4,13 +4,13 @@
 
 #include 
 
-#include 
-
 #include 
 #include 
 #include 
 #include 
 
+#include 
+
 #ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
 
 /*
diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index 
e5f204b9b33dfaa92ed1b05faa6b604e50d5f2f3..42c5acb67c2d2a6b03deb548fe3dd088baa88842
 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -55,4 +55,28 @@
 # define X86_CR3_PTI_PCID_USER_BIT 11
 #endif
 
+/*
+ * An ASI identifier is included in the higher bits of PCID to use a different
+ * PCID for each restricted address space, different from the PCID of the
+ * unrestricted address space (see asi_pcid()). We use the bits directly after
+ * the bit used by PTI (if any).
+ */
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+
+#define X86_CR3_ASI_PCID_BITS 2
+
+/* Use the highest available PCID bits after the PTI bit (if any) */
+#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
+#define X86_CR3_ASI_PCID_END_BIT (X86_CR3_PTI_PCID_USER_BIT - 1)
+#else
+#define X86_CR3_ASI_PCID_END_BIT (X86_CR3_PCID_BITS - 1)
+#endif
+
+#define X86_CR3_ASI_PCID_BITS_SHIFT (X86_CR3_ASI_PCID_END_BIT - 
X86_CR3_ASI_PCID_BITS + 1)
+#define X86_CR3_ASI_PCID_MASK (((1UL << X86_CR3_ASI_PCID_BITS) - 1) << 
X86_CR3_ASI_PCID_BITS_SHIFT)
+
+#else
+#define X86_CR3_ASI_PCID_BITS 0
+#endif
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 
c884174a44e119a3c027c44ada6c5cdba14d1282..f167feb5ebdfc7faba26b8b18ac65888cd9b0494
 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -425,5 +425,8 @@ static inline void __native_tlb_flush_global(unsigned long 
cr4)
 }
 
 unsigned long build_cr3_noinstr(pgd_t *pgd, u16 asid, unsigned long lam);
+unsigned long build_cr3_pcid_noinstr(pgd_t *pgd, u16 pcid, unsigned long lam, 
bool noflush);
+
+u16 asi_pcid(struct asi *asi, u16 asid);
 
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
054315d566c082c0925a00ce3a0877624c8b9957..8d060c633be68b508847e2c1c111761df1da92af
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -238,6 +238,7 @@ static __always_inline void maybe_flush_data(struct asi 
*next_asi)
 noinstr void __asi_enter(void)
 {
u64 asi_cr3;
+   u16 pcid;
struct asi *target = asi_get_target(current);
 
/*
@@ -266,9 +267,8 @@ noinstr void __asi_enter(void)
this_cpu_write(curr_asi, target);
maybe_flush_control(target);
 
-   asi_cr3 = build_cr3_noinstr(target->pgd,
-   this_cpu_read(cpu_tlbstate.loaded_mm_asid),
-   tlbstate_lam_cr3_mask());
+   pcid = asi_pcid(target, this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+   asi_cr3 = build_cr3_pcid_noinstr(target->pgd, pcid, 
tlbstate_lam_cr3_mask(), false);
write_cr3(asi_cr3);
 
maybe_flush_data(target);
@@ -335,8 +335,8 @@ noinstr void asi_exit(void)
 
unrestricted_cr3 =

build_cr3_noinstr(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- 
this_cpu_read(cpu_tlbstate.loaded_mm_asid),
- tlbstate_lam_cr3_mask());
+
this_cpu_read(cpu_tlbstate.loaded_mm_asid),
+tlbstate_lam_cr3_mask());
 
/* Tainting first makes reentrancy easier to reason about.  */
this_cpu_or(asi_taints, ASI_TAINT_KERNEL_DATA);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 
7c2309996d1d5a7cac23bd122f7b56a869d67d6a..2601beed83aef182d88800c09d70e4c5e95e7ed0
 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 
+#include

[PATCH RFC v2 07/29] mm: asi: Make __get_current_cr3_fast() ASI-aware

2025-01-10 Thread Brendan Jackman

From: Junaid Shahid 

When ASI is active, __get_current_cr3_fast() adjusts the returned CR3
value accordingly to reflect the actual ASI CR3.

Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/tlb.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 
2601beed83aef182d88800c09d70e4c5e95e7ed0..b2a13fdab0c6454c1d9d4e3338801f3402da4191
 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "mm_internal.h"
@@ -197,8 +198,8 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, 
u16 asid,
return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
 }
 
-noinstr unsigned long build_cr3_pcid_noinstr(pgd_t *pgd, u16 pcid,
-unsigned long lam, bool noflush)
+static __always_inline unsigned long build_cr3_pcid(pgd_t *pgd, u16 pcid,
+   unsigned long lam, bool 
noflush)
 {
u64 noflush_bit = 0;
 
@@ -210,6 +211,12 @@ noinstr unsigned long build_cr3_pcid_noinstr(pgd_t *pgd, 
u16 pcid,
return __build_cr3(pgd, pcid, lam) | noflush_bit;
 }
 
+noinstr unsigned long build_cr3_pcid_noinstr(pgd_t *pgd, u16 pcid,
+unsigned long lam, bool noflush)
+{
+   return build_cr3_pcid(pgd, pcid, lam, noflush);
+}
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -1133,14 +1140,32 @@ void flush_tlb_kernel_range(unsigned long start, 
unsigned long end)
  */
 noinstr unsigned long __get_current_cr3_fast(void)
 {
-   unsigned long cr3 =
-   build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid),
- tlbstate_lam_cr3_mask());
+   unsigned long cr3;
+   pgd_t *pgd;
+   u16 asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+   struct asi *asi = asi_get_current();
+   u16 pcid;
+
+   if (asi) {
+   pgd = asi_pgd(asi);
+   pcid = asi_pcid(asi, asid);
+   } else {
+   pgd = this_cpu_read(cpu_tlbstate.loaded_mm)->pgd;
+   pcid = kern_pcid(asid);
+   }
+
+   cr3 = build_cr3_pcid(pgd, pcid, tlbstate_lam_cr3_mask(), false);
 
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
 
+   /*
+* Outside of the ASI critical section, an ASI-restricted CR3 is
+* unstable because an interrupt (including an inner interrupt, if we're
+* already in one) could cause a persistent asi_exit.
+*/
+   VM_WARN_ON_ONCE(asi && asi_in_critical_section());
+
VM_BUG_ON(cr3 != __read_cr3());
return cr3;
 }

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 10/29] mm: asi: asi_exit() on PF, skip handling if address is accessible

2025-01-10 Thread Brendan Jackman

From: Ofir Weisse 

On a page-fault - do asi_exit(). Then check if now after the exit the
address is accessible. We do this by refactoring spurious_kernel_fault()
into two parts:

1. Verify that the error code value is something that could arise from a
lazy TLB update.
2. Walk the page table and verify permissions, which is now called
is_address_accessible(). We also define PTE_PRESENT() and PMD_PRESENT()
which are suitable for checking userspace pages. For the sake of
spurious faults,  pte_present() and pmd_present() are only good for
kernelspace pages. This is because these macros might return true even
if the present bit is 0 (only relevant for userspace).

checkpatch.pl VSPRINTF_SPECIFIER_PX - it's in a WARN that only fires in
a debug build of the kernel when we hit a disastrous bug, seems OK to
leak addresses.

RFC note: A separate refactoring/prep commit should be split out of this
patch.

Checkpatch-args: --ignore=VSPRINTF_SPECIFIER_PX
Signed-off-by: Ofir Weisse 
Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/fault.c | 118 +---
 1 file changed, 103 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 
e6c469b323ccb748de22adc7d9f0a16dd195edad..ee8f5417174e2956391d538f41e2475553ca4972
 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -948,7 +948,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
 }
 
-static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
+static __always_inline int kernel_protection_ok(unsigned long error_code, 
pte_t *pte)
 {
if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
return 0;
@@ -959,6 +959,8 @@ static int spurious_kernel_fault_check(unsigned long 
error_code, pte_t *pte)
return 1;
 }
 
+static int kernel_access_ok(unsigned long error_code, unsigned long address, 
pgd_t *pgd);
+
 /*
  * Handle a spurious fault caused by a stale TLB entry.
  *
@@ -984,11 +986,6 @@ static noinline int
 spurious_kernel_fault(unsigned long error_code, unsigned long address)
 {
pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-   int ret;
 
/*
 * Only writes to RO or instruction fetches from NX may cause
@@ -1004,6 +1001,50 @@ spurious_kernel_fault(unsigned long error_code, unsigned 
long address)
return 0;
 
pgd = init_mm.pgd + pgd_index(address);
+   return kernel_access_ok(error_code, address, pgd);
+}
+NOKPROBE_SYMBOL(spurious_kernel_fault);
+
+/*
+ * For kernel addresses, pte_present and pmd_present are sufficient for
+ * is_address_accessible. For user addresses these functions will return true
+ * even though the pte is not actually accessible by hardware (i.e 
_PAGE_PRESENT
+ * is not set). This happens in cases where the pages are physically present in
+ * memory, but they are not made accessible to hardware as they need software
+ * handling first:
+ *
+ * - ptes/pmds with _PAGE_PROTNONE need autonuma balancing (see pte_protnone(),
+ *   change_prot_numa(), and do_numa_page()).
+ *
+ * - pmds with _PAGE_PSE & !_PAGE_PRESENT are undergoing splitting (see
+ *   split_huge_page()).
+ *
+ * Here, we care about whether the hardware can actually access the page right
+ * now.
+ *
+ * These issues aren't currently present for PUD but we also have a custom
+ * PUD_PRESENT for a layer of future-proofing.
+ */
+#define PUD_PRESENT(pud) (pud_flags(pud) & _PAGE_PRESENT)
+#define PMD_PRESENT(pmd) (pmd_flags(pmd) & _PAGE_PRESENT)
+#define PTE_PRESENT(pte) (pte_flags(pte) & _PAGE_PRESENT)
+
+/*
+ * Check if an access by the kernel would cause a page fault. The access is
+ * described by a page fault error code (whether it was a write/instruction
+ * fetch) and address. This doesn't check for types of faults that are not
+ * expected to affect the kernel, e.g. PKU. The address can be user or kernel
+ * space, if user then we assume the access would happen via the uaccess API.
+ */
+static noinstr int
+kernel_access_ok(unsigned long error_code, unsigned long address, pgd_t *pgd)
+{
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
if (!pgd_present(*pgd))
return 0;
 
@@ -1012,27 +1053,27 @@ spurious_kernel_fault(unsigned long error_code, 
unsigned long address)
return 0;
 
if (p4d_leaf(*p4d))
-   return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
+   return kernel_protection_ok(error_code, (pte_t *) p4d);
 
pud = pud_offset(p4d, address);
-   if (!pud_present(*pud))
+   if (!PUD_PRESENT(*pud))
return 0;
 
if (pud_leaf(*pud))
-   return spurious_kernel_fault_check(error_code, (pte_t *) pud);
+   return ker

[PATCH RFC v2 08/29] mm: asi: Avoid warning from NMI userspace accesses in ASI context

2025-01-10 Thread Brendan Jackman

nmi_uaccess_okay() emits a warning if current CR3 != mm->pgd.
Limit the warning to only when ASI is not active.

Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Co-developed-by: Yosry Ahmed 
Signed-off-by: Yosry Ahmed 
Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/tlb.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 
b2a13fdab0c6454c1d9d4e3338801f3402da4191..c41e083c5b5281684be79ad0391c1a5fc7b0c493
 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1340,6 +1340,22 @@ void arch_tlbbatch_flush(struct 
arch_tlbflush_unmap_batch *batch)
put_cpu();
 }
 
+static inline bool cr3_matches_current_mm(void)
+{
+   struct asi *asi = asi_get_current();
+   pgd_t *pgd_asi = asi_pgd(asi);
+   pgd_t *pgd_cr3;
+
+   /*
+* Prevent read_cr3_pa -> [NMI, asi_exit] -> asi_get_current,
+* otherwise we might find CR3 pointing to the ASI PGD but not
+* find a current ASI domain.
+*/
+   barrier();
+   pgd_cr3 = __va(read_cr3_pa());
+   return pgd_cr3 == current->mm->pgd || pgd_cr3 == pgd_asi;
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
@@ -1355,10 +1371,10 @@ bool nmi_uaccess_okay(void)
VM_WARN_ON_ONCE(!loaded_mm);
 
/*
-* The condition we want to check is
-* current_mm->pgd == __va(read_cr3_pa()).  This may be slow, though,
-* if we're running in a VM with shadow paging, and nmi_uaccess_okay()
-* is supposed to be reasonably fast.
+* The condition we want to check that CR3 points to either
+* current_mm->pgd or an appropriate ASI PGD. Reading CR3 may be slow,
+* though, if we're running in a VM with shadow paging, and
+* nmi_uaccess_okay() is supposed to be reasonably fast.
 *
 * Instead, we check the almost equivalent but somewhat conservative
 * condition below, and we rely on the fact that switch_mm_irqs_off()
@@ -1367,7 +1383,7 @@ bool nmi_uaccess_okay(void)
if (loaded_mm != current_mm)
return false;
 
-   VM_WARN_ON_ONCE(current_mm->pgd != __va(read_cr3_pa()));
+   VM_WARN_ON_ONCE(!cr3_matches_current_mm());
 
return true;
 }

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 09/29] mm: asi: ASI page table allocation functions

2025-01-10 Thread Brendan Jackman

From: Junaid Shahid 

This adds custom allocation and free functions for ASI page tables.

The alloc functions support allocating memory using different GFP
reclaim flags, in order to be able to support non-sensitive allocations
from both standard and atomic contexts. They also install the page
tables locklessly, which makes it slightly simpler to handle
non-sensitive allocations from interrupts/exceptions.

checkpatch.pl MACRO_ARG_UNUSED,SPACING is false positive. COMPLEX_MACRO - I
dunno, suggestions welcome.

Checkpatch-args: --ignore=MACRO_ARG_UNUSED,SPACING,COMPLEX_MACRO
Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/asi.c | 59 +++
 1 file changed, 59 insertions(+)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
8d060c633be68b508847e2c1c111761df1da92af..b15d043acedc9f459f17e86564a15061650afc3a
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -73,6 +73,65 @@ const char *asi_class_name(enum asi_class_id class_id)
return asi_class_names[class_id];
 }
 
+#ifndef mm_inc_nr_p4ds
+#define mm_inc_nr_p4ds(mm) do {} while (false)
+#endif
+
+#ifndef mm_dec_nr_p4ds
+#define mm_dec_nr_p4ds(mm) do {} while (false)
+#endif
+
+#define pte_offset pte_offset_kernel
+
+/*
+ * asi_p4d_alloc, asi_pud_alloc, asi_pmd_alloc, asi_pte_alloc.
+ *
+ * These are like the normal xxx_alloc functions, but:
+ *
+ *  - They use atomic operations instead of taking a spinlock; this allows them
+ *to be used from interrupts. This is necessary because we use the page
+ *allocator from interrupts and the page allocator ultimately calls this
+ *code.
+ *  - They support customizing the allocation flags.
+ *
+ * On the other hand, they do not use the normal page allocation 
infrastructure,
+ * that means that PTE pages do not have the PageTable type nor the PagePgtable
+ * flag and we don't increment the meminfo stat (NR_PAGETABLE) as they do.
+ */
+static_assert(!IS_ENABLED(CONFIG_PARAVIRT));
+#define DEFINE_ASI_PGTBL_ALLOC(base, level)\
+__maybe_unused \
+static level##_t * asi_##level##_alloc(struct asi *asi,
\
+  base##_t *base, ulong addr,  \
+  gfp_t flags) \
+{  \
+   if (unlikely(base##_none(*base))) { \
+   ulong pgtbl = get_zeroed_page(flags);   \
+   phys_addr_t pgtbl_pa;   \
+   \
+   if (!pgtbl) \
+   return NULL;\
+   \
+   pgtbl_pa = __pa(pgtbl); \
+   \
+   if (cmpxchg((ulong *)base, 0,   \
+   pgtbl_pa | _PAGE_TABLE) != 0) { \
+   free_page(pgtbl);   \
+   goto out;   \
+   }   \
+   \
+   mm_inc_nr_##level##s(asi->mm);  \
+   }   \
+out:   \
+   VM_BUG_ON(base##_leaf(*base));  \
+   return level##_offset(base, addr);  \
+}
+
+DEFINE_ASI_PGTBL_ALLOC(pgd, p4d)
+DEFINE_ASI_PGTBL_ALLOC(p4d, pud)
+DEFINE_ASI_PGTBL_ALLOC(pud, pmd)
+DEFINE_ASI_PGTBL_ALLOC(pmd, pte)
+
 void __init asi_check_boottime_disable(void)
 {
bool enabled = 
IS_ENABLED(CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION_DEFAULT_ON);

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 02/29] x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION

2025-01-10 Thread Brendan Jackman

Currently a nop config. Keeping as a separate commit for easy review of
the boring bits. Later commits will use and enable this new config.

This config is only added for non-UML x86_64 as other architectures do
not yet have pending implementations. It also has somewhat artificial
dependencies on !PARAVIRT and !KASAN which are explained in the Kconfig
file.

Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
---
 arch/alpha/include/asm/Kbuild  |  1 +
 arch/arc/include/asm/Kbuild|  1 +
 arch/arm/include/asm/Kbuild|  1 +
 arch/arm64/include/asm/Kbuild  |  1 +
 arch/csky/include/asm/Kbuild   |  1 +
 arch/hexagon/include/asm/Kbuild|  1 +
 arch/loongarch/include/asm/Kbuild  |  3 +++
 arch/m68k/include/asm/Kbuild   |  1 +
 arch/microblaze/include/asm/Kbuild |  1 +
 arch/mips/include/asm/Kbuild   |  1 +
 arch/nios2/include/asm/Kbuild  |  1 +
 arch/openrisc/include/asm/Kbuild   |  1 +
 arch/parisc/include/asm/Kbuild |  1 +
 arch/powerpc/include/asm/Kbuild|  1 +
 arch/riscv/include/asm/Kbuild  |  1 +
 arch/s390/include/asm/Kbuild   |  1 +
 arch/sh/include/asm/Kbuild |  1 +
 arch/sparc/include/asm/Kbuild  |  1 +
 arch/um/include/asm/Kbuild |  2 +-
 arch/x86/Kconfig   | 14 ++
 arch/xtensa/include/asm/Kbuild |  1 +
 include/asm-generic/asi.h  |  5 +
 22 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 
396caece6d6d99c7a428f439322a0a18452e1a42..ca72ce3baca13a32913ac9e01a8f86ef42180b1c
 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -5,3 +5,4 @@ generic-y += agp.h
 generic-y += asm-offsets.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += asi.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 
49285a3ce2398cc7442bc44172de76367dc33dda..68604480864bbcb58d896da6bdf71591006ab2f6
 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -6,3 +6,4 @@ generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += user.h
+generic-y += asi.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 
03657ff8fbe3d202563184b8902aa181e7474a5e..1e2c3d8dbbd99bdf95dbc6b47c2c78092c68b808
 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -6,3 +6,4 @@ generic-y += parport.h
 
 generated-y += mach-types.h
 generated-y += unistd-nr.h
+generic-y += asi.h
diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 
4e350df9a02dd8de387b912740af69035da93e34..15f896b80b5657b789ecf3529b1f18d16d80
 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -14,6 +14,7 @@ generic-y += qrwlock.h
 generic-y += qspinlock.h
 generic-y += parport.h
 generic-y += user.h
+generic-y += asi.h
 
 generated-y += cpucap-defs.h
 generated-y += sysreg-defs.h
diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild
index 
9a9bc65b57a9d73dadc9d597700d7229f8554ddf..4f497118fb172d1f2bf0f9e472479f24227f42f4
 100644
--- a/arch/csky/include/asm/Kbuild
+++ b/arch/csky/include/asm/Kbuild
@@ -11,3 +11,4 @@ generic-y += qspinlock.h
 generic-y += parport.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
+generic-y += asi.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 
8c1a78c8f5271ebd47f1baad7b85e87220d1bbe8..b26f186bc03c2e135f8d125a4805b95a41513655
 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -5,3 +5,4 @@ generic-y += extable.h
 generic-y += iomap.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
+generic-y += asi.h
diff --git a/arch/loongarch/include/asm/Kbuild 
b/arch/loongarch/include/asm/Kbuild
index 
5b5a6c90e6e20771b1074a6262230861cc51bcb4..dd3d0c6891369a9dfa35ccfb8b81c8697c2a3e90
 100644
--- a/arch/loongarch/include/asm/Kbuild
+++ b/arch/loongarch/include/asm/Kbuild
@@ -11,3 +11,6 @@ generic-y += ioctl.h
 generic-y += mmzone.h
 generic-y += statfs.h
 generic-y += param.h
+generic-y += asi.h
+generic-y += posix_types.h
+generic-y += resource.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index 
0dbf9c5c6faeb30eeb38bea52ab7fade99bbd44a..faf0f135df4ab946ef115f3a2fc363f370fc7491
 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -4,3 +4,4 @@ generic-y += extable.h
 generic-y += kvm_para.h
 generic-y += mcs_spinlock.h
 generic-y += spinlock.h
+generic-y += asi.h
diff --git a/arch/microblaze/include/asm/Kbuild 
b/arch/microblaze/include/asm/Kbuild
index 
a055f5dbe00a31616592c3a848b49bbf9ead5d17..012e4bf83c13497dc296b66cd5e0fd519274306b
 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -8,3 +8,4 @@ generic-y += parport.h
 generic-y += syscalls.h
 generic-y += tlb.h
 generic-y += user.h
+generic-y

[PATCH RFC v2 04/29] mm: asi: Add infrastructure for boot-time enablement

2025-01-10 Thread Brendan Jackman

Add a boot time parameter to control the newly added X86_FEATURE_ASI.
"asi=on" or "asi=off" can be used in the kernel command line to enable
or disable ASI at boot time. If not specified, ASI enablement depends
on CONFIG_ADDRESS_SPACE_ISOLATION_DEFAULT_ON, which is off by default.
asi_check_boottime_disable() is modeled after
pti_check_boottime_disable().

The boot parameter is currently ignored until ASI is fully functional.

Once we have a set of ASI features checked in that we have actually
tested, we will stop ignoring the flag. But for now let's just add the
infrastructure so we can implement the usage code.

Ignoring checkpatch.pl CONFIG_DESCRIPTION because the _DEFAULT_ON
Kconfig is trivial to explain.

Checkpatch-args: --ignore CONFIG_DESCRIPTION
Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Co-developed-by: Yosry Ahmed 
Signed-off-by: Yosry Ahmed 
Signed-off-by: Brendan Jackman 
---
 arch/x86/Kconfig |  9 +
 arch/x86/include/asm/asi.h   | 19 --
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/disabled-features.h |  8 -
 arch/x86/mm/asi.c| 61 +++-
 arch/x86/mm/init.c   |  4 ++-
 include/asm-generic/asi.h|  4 +++
 7 files changed, 92 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 
5a50582eb210e9d1309856a737d32b76fa1bfc85..1fcb52cb8cd5084ac3cef04af61b7d1653162bdb
 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2533,6 +2533,15 @@ config MITIGATION_ADDRESS_SPACE_ISOLATION
  there are likely to be unhandled cases, in particular concerning TLB
  flushes.
 
+
+config ADDRESS_SPACE_ISOLATION_DEFAULT_ON
+   bool "Enable address space isolation by default"
+   default n
+   depends on MITIGATION_ADDRESS_SPACE_ISOLATION
+   help
+ If selected, ASI is enabled by default at boot if the asi=on or
+ asi=off are not specified.
+
 config MITIGATION_RETPOLINE
bool "Avoid speculative indirect branches in kernel"
select OBJTOOL if HAVE_OBJTOOL
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
7cc635b6653a3970ba9dbfdc9c828a470e27bd44..b9671ef2dd3278adceed18507fd260e21954d574
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -8,6 +8,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
@@ -66,6 +67,8 @@
  * the N ASI classes.
  */
 
+#define static_asi_enabled() cpu_feature_enabled(X86_FEATURE_ASI)
+
 /*
  * ASI uses a per-CPU tainting model to track what mitigation actions are
  * required on domain transitions. Taints exist along two dimensions:
@@ -131,6 +134,8 @@ struct asi {
 
 DECLARE_PER_CPU_ALIGNED(struct asi *, curr_asi);
 
+void asi_check_boottime_disable(void);
+
 void asi_init_mm_state(struct mm_struct *mm);
 
 int asi_init_class(enum asi_class_id class_id, struct asi_taint_policy 
*taint_policy);
@@ -155,7 +160,9 @@ void asi_exit(void);
 /* The target is the domain we'll enter when returning to process context. */
 static __always_inline struct asi *asi_get_target(struct task_struct *p)
 {
-   return p->thread.asi_state.target;
+   return static_asi_enabled()
+  ? p->thread.asi_state.target
+  : NULL;
 }
 
 static __always_inline void asi_set_target(struct task_struct *p,
@@ -166,7 +173,9 @@ static __always_inline void asi_set_target(struct 
task_struct *p,
 
 static __always_inline struct asi *asi_get_current(void)
 {
-   return this_cpu_read(curr_asi);
+   return static_asi_enabled()
+  ? this_cpu_read(curr_asi)
+  : NULL;
 }
 
 /* Are we currently in a restricted address space? */
@@ -175,7 +184,11 @@ static __always_inline bool asi_is_restricted(void)
return (bool)asi_get_current();
 }
 
-/* If we exit/have exited, can we stay that way until the next asi_enter? */
+/*
+ * If we exit/have exited, can we stay that way until the next asi_enter?
+ *
+ * When ASI is disabled, this returns true.
+ */
 static __always_inline bool asi_is_relaxed(void)
 {
return !asi_get_target(current);
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 
913fd3a7bac6506141de65f33b9ee61c615c7d7d..d6a808d10c3b8900d190ea01c66fc248863f05e2
 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -474,6 +474,7 @@
 #define X86_FEATURE_CLEAR_BHB_HW   (21*32+ 3) /* BHI_DIS_S HW control 
enabled */
 #define X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT (21*32+ 4) /* Clear branch 
history at vmexit using SW loop */
 #define X86_FEATURE_FAST_CPPC  (21*32 + 5) /* AMD Fast CPPC */
+#define X86_FEATURE_ASI(21*32+6) /* Kernel Address 
Space Isolation */
 
 /*
  * BUG word(s)
diff --git a/arch/x86/include/asm/disable

[PATCH RFC v2 14/29] mm: asi: Map non-user buddy allocations as nonsensitive

2025-01-10 Thread Brendan Jackman

This is just simplest possible page_alloc patch I could come up with to
demonstrate ASI working in a "denylist" mode: we map the direct map into
the restricted address space, except pages allocated with GFP_USER.

Pages must be asi_unmap()'d before they can be re-allocated. This
requires a TLB flush, which can't generally be done from the free path
(requires IRQs on), so pages that need unmapping are freed via a
workqueue.

This solution is not ideal:

 - If the async queue gets long, we'll run out of allocatable memory.
 - We don't batch the TLB flushing or worker wakeups at all.
 - We drop FPI flags and skip the pcplists.

Internally at Google we've so far found with extra complexity we're able
to make this solution work for the workloads we've tested so far. It
seems likely this won't keep working forever. So instead for the [PATCH]
version I hope to come up with an implementation that instead just makes
the allocator more deeply aware of sensitivity, most likely this will
look a bit like an extra "dimension" like movability etc. This was
discussed at LSF/MM/BPF [1], I plan to research this right after RFCv2.
However, once that research is done we might want to consider merging a
sub-optimal solution to unblock iteration and development.

[1] https://youtu.be/WD9-ey8LeiI

The main thing in here that is "real" and may warrant discussion is
__GFP_SENSITIVE (or at least, some sort of allocator switch to determine
sensitivity, in an "allowlist" model we would probably have the
opposite, and in future iterations we might want additional options for
different "types" of sensitivity). I think we need this as an extension
to the allocation API; the main alternative would be to infer from
context of the allocation whether the data should be treated as
sensitive; however I think we will have contexts where both sensitive
and nonsensitive data needs to be allocatable.

If there are concerns about __GFP flags specifically, rather than just
the general problem of expanding the allocator API, we could always just
provide an API like __alloc_pages_sensitive or something, implemented
with ALLOC_ flags internally.

Checkpatch-args: --ignore=SPACING,MACRO_ARG_UNUSED,COMPLEX_MACRO
Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/asi.c  |  33 +-
 include/linux/gfp.h|   5 ++
 include/linux/gfp_types.h  |  15 -
 include/linux/page-flags.h |  11 
 include/trace/events/mmflags.h |  12 +++-
 mm/mm_init.c   |   1 +
 mm/page_alloc.c| 134 -
 tools/perf/builtin-kmem.c  |   1 +
 8 files changed, 205 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
17391ec8b22e3c0903cd5ca29cbb03fcc4cbacce..b951f2100b8bdea5738ded16166255deb29faf57
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -5,6 +5,8 @@
 #include 
 
 #include 
+#include 
+
 #include 
 #include 
 #include 
@@ -104,10 +106,17 @@ const char *asi_class_name(enum asi_class_id class_id)
  *allocator from interrupts and the page allocator ultimately calls this
  *code.
  *  - They support customizing the allocation flags.
+ *  - They avoid infinite recursion when the page allocator calls back to
+ *asi_map
  *
  * On the other hand, they do not use the normal page allocation 
infrastructure,
  * that means that PTE pages do not have the PageTable type nor the PagePgtable
  * flag and we don't increment the meminfo stat (NR_PAGETABLE) as they do.
+ *
+ * As an optimisation we attempt to map the pagetables in
+ * ASI_GLOBAL_NONSENSITIVE, but this can fail, and for simplicity we don't do
+ * anything about that. This means it's invalid to access ASI pagetables from a
+ * critical section.
  */
 static_assert(!IS_ENABLED(CONFIG_PARAVIRT));
 #define DEFINE_ASI_PGTBL_ALLOC(base, level)\
@@ -116,8 +125,11 @@ static level##_t * asi_##level##_alloc(struct asi *asi,
\
   gfp_t flags) \
 {  \
if (unlikely(base##_none(*base))) { \
-   ulong pgtbl = get_zeroed_page(flags);   \
+   /* Stop asi_map calls causing recursive allocation */   \
+   gfp_t pgtbl_gfp = flags | __GFP_SENSITIVE;  \
+   ulong pgtbl = get_zeroed_page(pgtbl_gfp);   \
phys_addr_t pgtbl_pa;   \
+   int err;\
\
if (!pgtbl) \
return NULL;

[PATCH TEMP WORKAROUND RFC v2 15/29] mm: asi: Workaround missing partial-unmap support

2025-01-10 Thread Brendan Jackman

This is a hack, no need to review it carefully. asi_unmap() doesn't
currently work unless it corresponds exactly to an asi_map() of the
exact same region.

This is mostly harmless (it's only a functional problem if you wanna
touch those pages from the ASI critical section) but it's messy. For
now, working around the only practical case that appears by moving the
asi_map call up the call stack in the page allocator, to the place where
we know the actual size the mapping is supposed to end up at.

This just removes the main case where that happens. Later, a proper
solution for partial unmaps will be needed.

Signed-off-by: Brendan Jackman 
---
 mm/page_alloc.c | 40 ++--
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 
3e98fdfbadddb1f7d71e9e050b63255b2008d167..f96e95032450be90b6567f67915b0b941fc431d8
 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4604,22 +4604,20 @@ void __init page_alloc_init_asi(void)
}
 }
 
-static int asi_map_alloced_pages(struct page *page, uint order, gfp_t gfp_mask)
+static int asi_map_alloced_pages(struct page *page, size_t size, gfp_t 
gfp_mask)
 {
 
if (!static_asi_enabled())
return 0;
 
if (!(gfp_mask & __GFP_SENSITIVE)) {
-   int err = asi_map_gfp(
-   ASI_GLOBAL_NONSENSITIVE, page_to_virt(page),
-   PAGE_SIZE * (1 << order), gfp_mask);
+   int err = asi_map_gfp(ASI_GLOBAL_NONSENSITIVE, 
page_to_virt(page), size, gfp_mask);
uint i;
 
if (err)
return err;
 
-   for (i = 0; i < (1 << order); i++)
+   for (i = 0; i < (size >> PAGE_SHIFT); i++)
__SetPageGlobalNonSensitive(page + i);
}
 
@@ -4629,7 +4627,7 @@ static int asi_map_alloced_pages(struct page *page, uint 
order, gfp_t gfp_mask)
 #else /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
 
 static inline
-int asi_map_alloced_pages(struct page *pages, uint order, gfp_t gfp_mask)
+int asi_map_alloced_pages(struct page *pages, size_t size, gfp_t gfp_mask)
 {
return 0;
 }
@@ -4896,7 +4894,7 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int 
order,
trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
kmsan_alloc_page(page, order, alloc_gfp);
 
-   if (page && unlikely(asi_map_alloced_pages(page, order, gfp))) {
+   if (page && unlikely(asi_map_alloced_pages(page, PAGE_SIZE << order, 
gfp))) {
__free_pages(page, order);
page = NULL;
}
@@ -5118,12 +5116,13 @@ void page_frag_free(void *addr)
 }
 EXPORT_SYMBOL(page_frag_free);
 
-static void *make_alloc_exact(unsigned long addr, unsigned int order,
-   size_t size)
+static void *finish_exact_alloc(unsigned long addr, unsigned int order,
+   size_t size, gfp_t gfp_mask)
 {
if (addr) {
unsigned long nr = DIV_ROUND_UP(size, PAGE_SIZE);
struct page *page = virt_to_page((void *)addr);
+   struct page *first = page;
struct page *last = page + nr;
 
split_page_owner(page, order, 0);
@@ -5132,9 +5131,22 @@ static void *make_alloc_exact(unsigned long addr, 
unsigned int order,
while (page < --last)
set_page_refcounted(last);
 
-   last = page + (1UL << order);
+   last = page + (1 << order);
for (page += nr; page < last; page++)
__free_pages_ok(page, 0, FPI_TO_TAIL);
+
+   /*
+* ASI doesn't support partially undoing calls to asi_map, so
+* we can only safely free sub-allocations if they were made
+* with __GFP_SENSITIVE in the first place. Users of this need
+* to map with forced __GFP_SENSITIVE and then here we'll make a
+* second asi_map_alloced_pages() call to do any mapping that's
+* necessary, but with the exact size.
+*/
+   if (unlikely(asi_map_alloced_pages(first, nr << PAGE_SHIFT, 
gfp_mask))) {
+   free_pages_exact(first, size);
+   return NULL;
+   }
}
return (void *)addr;
 }
@@ -5162,8 +5174,8 @@ void *alloc_pages_exact_noprof(size_t size, gfp_t 
gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
 
-   addr = get_free_pages_noprof(gfp_mask, order);
-   return make_alloc_exact(addr, order, size);
+   addr = get_free_pages_noprof(gfp_mask | __GFP_SENSITIVE, order);
+   return finish_exact_alloc(addr, order, size, gfp_mask);
 }
 EXPORT_SYMBOL(alloc_pages_exact_nopr

[PATCH RFC v2 11/29] mm: asi: Functions to map/unmap a memory range into ASI page tables

2025-01-10 Thread Brendan Jackman

From: Junaid Shahid 

Two functions, asi_map() and asi_map_gfp(), are added to allow mapping
memory into ASI page tables. The mapping will be identical to the one
for the same virtual address in the unrestricted page tables. This is
necessary to allow switching between the page tables at any arbitrary
point in the kernel.

Another function, asi_unmap() is added to allow unmapping memory mapped
via asi_map*

RFC Notes: Don't read too much into the implementation of this, lots of
it should probably be rewritten. It also needs to gain support for
partial unmappings.

Checkpatch-args: --ignore=MACRO_ARG_UNUSED
Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
Signed-off-by: Kevin Cheng 
---
 arch/x86/include/asm/asi.h |   5 +
 arch/x86/mm/asi.c  | 236 -
 arch/x86/mm/tlb.c  |   5 +
 include/asm-generic/asi.h  |  11 +++
 include/linux/pgtable.h|   3 +
 mm/internal.h  |   2 +
 mm/vmalloc.c   |  32 +++---
 7 files changed, 280 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
a55e73f1b2bc84c41b9ab25f642a4d5f1aa6ba90..33f18be0e268b3a6725196619cbb8d847c21e197
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -157,6 +157,11 @@ void asi_relax(void);
 /* Immediately exit the restricted address space if in it */
 void asi_exit(void);
 
+int  asi_map_gfp(struct asi *asi, void *addr, size_t len, gfp_t gfp_flags);
+int  asi_map(struct asi *asi, void *addr, size_t len);
+void asi_unmap(struct asi *asi, void *addr, size_t len);
+void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len);
+
 static inline void asi_init_thread_state(struct thread_struct *thread)
 {
thread->asi_state.intr_nest_depth = 0;
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
b15d043acedc9f459f17e86564a15061650afc3a..f2d8fbc0366c289891903e1c2ac6c59b9476d95f
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -11,6 +11,9 @@
 #include 
 #include 
 #include 
+#include 
+
+#include "../../../mm/internal.h"
 
 static struct asi_taint_policy *taint_policies[ASI_MAX_NUM_CLASSES];
 
@@ -100,7 +103,6 @@ const char *asi_class_name(enum asi_class_id class_id)
  */
 static_assert(!IS_ENABLED(CONFIG_PARAVIRT));
 #define DEFINE_ASI_PGTBL_ALLOC(base, level)\
-__maybe_unused \
 static level##_t * asi_##level##_alloc(struct asi *asi,
\
   base##_t *base, ulong addr,  \
   gfp_t flags) \
@@ -455,3 +457,235 @@ void asi_handle_switch_mm(void)
this_cpu_or(asi_taints, new_taints);
this_cpu_and(asi_taints, ~(ASI_TAINTS_GUEST_MASK | 
ASI_TAINTS_USER_MASK));
 }
+
+static bool is_page_within_range(unsigned long addr, unsigned long page_size,
+unsigned long range_start, unsigned long 
range_end)
+{
+   unsigned long page_start = ALIGN_DOWN(addr, page_size);
+   unsigned long page_end = page_start + page_size;
+
+   return page_start >= range_start && page_end <= range_end;
+}
+
+static bool follow_physaddr(
+   pgd_t *pgd_table, unsigned long virt,
+   phys_addr_t *phys, unsigned long *page_size, ulong *flags)
+{
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   /* RFC: This should be rewritten with lookup_address_in_*. */
+
+   *page_size = PGDIR_SIZE;
+   pgd = pgd_offset_pgd(pgd_table, virt);
+   if (!pgd_present(*pgd))
+   return false;
+   if (pgd_leaf(*pgd)) {
+   *phys = PFN_PHYS(pgd_pfn(*pgd)) | (virt & ~PGDIR_MASK);
+   *flags = pgd_flags(*pgd);
+   return true;
+   }
+
+   *page_size = P4D_SIZE;
+   p4d = p4d_offset(pgd, virt);
+   if (!p4d_present(*p4d))
+   return false;
+   if (p4d_leaf(*p4d)) {
+   *phys = PFN_PHYS(p4d_pfn(*p4d)) | (virt & ~P4D_MASK);
+   *flags = p4d_flags(*p4d);
+   return true;
+   }
+
+   *page_size = PUD_SIZE;
+   pud = pud_offset(p4d, virt);
+   if (!pud_present(*pud))
+   return false;
+   if (pud_leaf(*pud)) {
+   *phys = PFN_PHYS(pud_pfn(*pud)) | (virt & ~PUD_MASK);
+   *flags = pud_flags(*pud);
+   return true;
+   }
+
+   *page_size = PMD_SIZE;
+   pmd = pmd_offset(pud, virt);
+   if (!pmd_present(*pmd))
+   return false;
+   if (pmd_leaf(*pmd)) {
+   *phys = PFN_PHYS(pmd_pfn(*pmd)) | (virt & ~PMD_MASK);
+   *flags = pmd_flags(*pmd);
+   return true;
+   }
+
+   *page_size = PAGE_SIZE;
+   pte = pte_offset_map(pmd, virt);
+   if (!pte)
+

[PATCH RFC v2 16/29] mm: asi: Map kernel text and static data as nonsensitive

2025-01-10 Thread Brendan Jackman

Basically we need to map the kernel code and all its static variables.
Per-CPU variables need to be treated specially as described in the
comments. The cpu_entry_area is similar - this needs to be
nonsensitive so that the CPU can access the GDT etc when handling
a page fault.

Under 5-level paging, most of the kernel memory comes under a single PGD
entry (see Documentation/x86/x86_64/mm.rst. Basically, the mapping is
for this big region is the same as under 4-level, just wrapped in an
outer PGD entry). For that region, the "clone" logic is moved down one
step of the paging hierarchy.

Note that the p4d_alloc in asi_clone_p4d won't actually be used in
practice; the relevant PGD entry will always have been populated by
prior asi_map calls so this code would "work" if we just wrote
p4d_offset (but asi_clone_p4d would be broken if viewed in isolation).

The vmemmap area is not under this single PGD, it has its own 2-PGD
area, so we still use asi_clone_pgd for that one.

Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/asi.c | 105 +-
 include/asm-generic/vmlinux.lds.h |  11 
 2 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
b951f2100b8bdea5738ded16166255deb29faf57..bc2cf0475a0e7344a66d81453f55034b2fc77eef
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -7,7 +7,6 @@
 #include 
 #include 
 
-#include 
 #include 
 #include 
 #include 
@@ -186,8 +185,68 @@ void __init asi_check_boottime_disable(void)
pr_info("ASI enablement ignored due to incomplete 
implementation.\n");
 }
 
+/*
+ * Map data by sharing sub-PGD pagetables with the unrestricted mapping. This 
is
+ * more efficient than asi_map, but only works when you know the whole 
top-level
+ * page needs to be mapped in the restricted tables. Note that the size of the
+ * mappings this creates differs between 4 and 5-level paging.
+ */
+static void asi_clone_pgd(pgd_t *dst_table, pgd_t *src_table, size_t addr)
+{
+   pgd_t *src = pgd_offset_pgd(src_table, addr);
+   pgd_t *dst = pgd_offset_pgd(dst_table, addr);
+
+   if (!pgd_val(*dst))
+   set_pgd(dst, *src);
+   else
+   WARN_ON_ONCE(pgd_val(*dst) != pgd_val(*src));
+}
+
+/*
+ * For 4-level paging this is exactly the same as asi_clone_pgd. For 5-level
+ * paging it clones one level lower. So this always creates a mapping of the
+ * same size.
+ */
+static void asi_clone_p4d(pgd_t *dst_table, pgd_t *src_table, size_t addr)
+{
+   pgd_t *src_pgd = pgd_offset_pgd(src_table, addr);
+   pgd_t *dst_pgd = pgd_offset_pgd(dst_table, addr);
+   p4d_t *src_p4d = p4d_alloc(&init_mm, src_pgd, addr);
+   p4d_t *dst_p4d = p4d_alloc(&init_mm, dst_pgd, addr);
+
+   if (!p4d_val(*dst_p4d))
+   set_p4d(dst_p4d, *src_p4d);
+   else
+   WARN_ON_ONCE(p4d_val(*dst_p4d) != p4d_val(*src_p4d));
+}
+
+/*
+ * percpu_addr is where the linker put the percpu variable. asi_map_percpu 
finds
+ * the place where the percpu allocator copied the data during boot.
+ *
+ * This is necessary even when the page allocator defaults to
+ * global-nonsensitive, because the percpu allocator uses the memblock 
allocator
+ * for early allocations.
+ */
+static int asi_map_percpu(struct asi *asi, void *percpu_addr, size_t len)
+{
+   int cpu, err;
+   void *ptr;
+
+   for_each_possible_cpu(cpu) {
+   ptr = per_cpu_ptr(percpu_addr, cpu);
+   err = asi_map(asi, ptr, len);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
 static int __init asi_global_init(void)
 {
+   int err;
+
if (!boot_cpu_has(X86_FEATURE_ASI))
return 0;
 
@@ -207,6 +266,46 @@ static int __init asi_global_init(void)
  VMALLOC_START, VMALLOC_END,
  "ASI Global Non-sensitive vmalloc");
 
+   /* Map all kernel text and static data */
+   err = asi_map(ASI_GLOBAL_NONSENSITIVE, (void *)__START_KERNEL,
+ (size_t)_end - __START_KERNEL);
+   if (WARN_ON(err))
+   return err;
+   err = asi_map(ASI_GLOBAL_NONSENSITIVE, (void *)FIXADDR_START,
+ FIXADDR_SIZE);
+   if (WARN_ON(err))
+   return err;
+   /* Map all static percpu data */
+   err = asi_map_percpu(
+   ASI_GLOBAL_NONSENSITIVE,
+   __per_cpu_start, __per_cpu_end - __per_cpu_start);
+   if (WARN_ON(err))
+   return err;
+
+   /*
+* The next areas are mapped using shared sub-P4D paging structures
+* (asi_clone_p4d instead of asi_map), since we know the whole P4D will
+* be mapped.
+*/
+   asi_clone_p4d(asi_global_nonsensitive_pgd, init_mm.pgd,
+ CPU_ENTRY_AREA_BASE);
+#ifdef CONFIG_X86_ESPFIX64
+   a

[PATCH RFC v2 17/29] mm: asi: Map vmalloc/vmap data as nonsensitive

2025-01-10 Thread Brendan Jackman

We add new VM flags for sensitive and global-nonsensitive, parallel to
the corresponding GFP flags.

__get_vm_area_node and friends will default to creating
global-nonsensitive VM areas, and vmap then calls asi_map as necessary.

__vmalloc_node_range has additional logic to check and set defaults for
the sensitivity of the underlying page allocation. It does this via an
initial __set_asi_flags call - note that it then calls
__get_vm_area_node which also calls __set_asi_flags. This second call
is a NOP.

By default, we mark the underlying page allocation as sensitive, even
if the VM area is global-nonsensitive. This is just an optimization to
avoid unnecessary asi_map etc, since presumably most code has no reason
to access vmalloc'd data through the direct map.

There are some details of the GFP-flag/VM-flag interaction that are not
really obvious, for example: what should happen when callers of
__vmalloc explicitly set GFP sensitivity flags? (That function has no VM
flags argument). For the moment let's just not block on that and focus
on adding the infrastructure, though.

At the moment, the high-level vmalloc APIs doesn't actually provide a
way to configure sensitivity, this commit just adds the infrastructure.
We'll have to decide how to expose this to allocation sites as we
implement more denylist logic. vmap does already allow configuring vm
flags.

Signed-off-by: Brendan Jackman 
---
 mm/vmalloc.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 
8d260f2174fe664b54dcda054cb9759ae282bf03..00745edf0b2c5f4c769a46bdcf0872223de5299d
 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3210,6 +3210,7 @@ struct vm_struct *remove_vm_area(const void *addr)
 {
struct vmap_area *va;
struct vm_struct *vm;
+   unsigned long vm_addr;
 
might_sleep();
 
@@ -3221,6 +3222,7 @@ struct vm_struct *remove_vm_area(const void *addr)
if (!va || !va->vm)
return NULL;
vm = va->vm;
+   vm_addr = (unsigned long) READ_ONCE(vm->addr);
 
debug_check_no_locks_freed(vm->addr, get_vm_area_size(vm));
debug_check_no_obj_freed(vm->addr, get_vm_area_size(vm));
@@ -3352,6 +3354,7 @@ void vfree(const void *addr)
addr);
return;
}
+   asi_unmap(ASI_GLOBAL_NONSENSITIVE, vm->addr, get_vm_area_size(vm));
 
if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
vm_reset_perms(vm);
@@ -3397,6 +3400,7 @@ void vunmap(const void *addr)
addr);
return;
}
+   asi_unmap(ASI_GLOBAL_NONSENSITIVE, vm->addr, get_vm_area_size(vm));
kfree(vm);
 }
 EXPORT_SYMBOL(vunmap);
@@ -3445,16 +3449,21 @@ void *vmap(struct page **pages, unsigned int count,
 
addr = (unsigned long)area->addr;
if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
-   pages, PAGE_SHIFT) < 0) {
-   vunmap(area->addr);
-   return NULL;
-   }
+   pages, PAGE_SHIFT) < 0)
+   goto err;
+
+   if (asi_map(ASI_GLOBAL_NONSENSITIVE, area->addr,
+   get_vm_area_size(area)))
+   goto err; /* The necessary asi_unmap() is in vunmap. */
 
if (flags & VM_MAP_PUT_PAGES) {
area->pages = pages;
area->nr_pages = count;
}
return area->addr;
+err:
+   vunmap(area->addr);
+   return NULL;
 }
 EXPORT_SYMBOL(vmap);
 
@@ -3711,6 +3720,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
goto fail;
}
 
+   if (asi_map(ASI_GLOBAL_NONSENSITIVE, area->addr,
+   get_vm_area_size(area)))
+   goto fail; /* The necessary asi_unmap() is in vfree. */
+
return area->addr;
 
 fail:

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 12/29] mm: asi: Add basic infrastructure for global non-sensitive mappings

2025-01-10 Thread Brendan Jackman

From: Junaid Shahid 

A pseudo-PGD is added to store global non-sensitive ASI mappings.
Actual ASI PGDs copy entries from this pseudo-PGD during asi_init().

Memory can be mapped as globally non-sensitive by calling asi_map()
with ASI_GLOBAL_NONSENSITIVE.

Page tables allocated for global non-sensitive mappings are never
freed.

These page tables are shared between all domains and init_mm, so they
don't need special synchronization.

RFC note: A refactoring/prep commit should be split out of this patch.

Signed-off-by: Junaid Shahid 
Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/asi.h |  3 +++
 arch/x86/mm/asi.c  | 37 +
 arch/x86/mm/init_64.c  | 25 -
 arch/x86/mm/mm_internal.h  |  3 +++
 include/asm-generic/asi.h  |  2 ++
 5 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
33f18be0e268b3a6725196619cbb8d847c21e197..555edb5f292e4d6baba782f51d014aa48dc850b6
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -120,6 +120,9 @@ struct asi_taint_policy {
asi_taints_t set;
 };
 
+extern struct asi __asi_global_nonsensitive;
+#define ASI_GLOBAL_NONSENSITIVE(&__asi_global_nonsensitive)
+
 /*
  * An ASI domain (struct asi) represents a restricted address space. The
  * unrestricted address space (and user address space under PTI) are not
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
f2d8fbc0366c289891903e1c2ac6c59b9476d95f..17391ec8b22e3c0903cd5ca29cbb03fcc4cbacce
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 
+#include "mm_internal.h"
 #include "../../../mm/internal.h"
 
 static struct asi_taint_policy *taint_policies[ASI_MAX_NUM_CLASSES];
@@ -26,6 +27,13 @@ const char *asi_class_names[] = {
 DEFINE_PER_CPU_ALIGNED(struct asi *, curr_asi);
 EXPORT_SYMBOL(curr_asi);
 
+static __aligned(PAGE_SIZE) pgd_t asi_global_nonsensitive_pgd[PTRS_PER_PGD];
+
+struct asi __asi_global_nonsensitive = {
+   .pgd = asi_global_nonsensitive_pgd,
+   .mm = &init_mm,
+};
+
 static inline bool asi_class_id_valid(enum asi_class_id class_id)
 {
return class_id >= 0 && class_id < ASI_MAX_NUM_CLASSES;
@@ -156,6 +164,31 @@ void __init asi_check_boottime_disable(void)
pr_info("ASI enablement ignored due to incomplete 
implementation.\n");
 }
 
+static int __init asi_global_init(void)
+{
+   if (!boot_cpu_has(X86_FEATURE_ASI))
+   return 0;
+
+   /*
+* Lower-level pagetables for global nonsensitive mappings are shared,
+* but the PGD has to be copied into each domain during asi_init. To
+* avoid needing to synchronize new mappings into pre-existing domains
+* we just pre-allocate all of the relevant level N-1 entries so that
+* the global nonsensitive PGD already has pointers that can be copied
+* when new domains get asi_init()ed.
+*/
+   preallocate_sub_pgd_pages(asi_global_nonsensitive_pgd,
+ PAGE_OFFSET,
+ PAGE_OFFSET + PFN_PHYS(max_pfn) - 1,
+ "ASI Global Non-sensitive direct map");
+   preallocate_sub_pgd_pages(asi_global_nonsensitive_pgd,
+ VMALLOC_START, VMALLOC_END,
+ "ASI Global Non-sensitive vmalloc");
+
+   return 0;
+}
+subsys_initcall(asi_global_init)
+
 static void __asi_destroy(struct asi *asi)
 {
WARN_ON_ONCE(asi->ref_count <= 0);
@@ -170,6 +203,7 @@ int asi_init(struct mm_struct *mm, enum asi_class_id 
class_id, struct asi **out_
 {
struct asi *asi;
int err = 0;
+   uint i;
 
*out_asi = NULL;
 
@@ -203,6 +237,9 @@ int asi_init(struct mm_struct *mm, enum asi_class_id 
class_id, struct asi **out_
asi->mm = mm;
asi->class_id = class_id;
 
+   for (i = KERNEL_PGD_BOUNDARY; i < PTRS_PER_PGD; i++)
+   set_pgd(asi->pgd + i, asi_global_nonsensitive_pgd[i]);
+
 exit_unlock:
if (err)
__asi_destroy(asi);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 
ff253648706fa9cd49169a54882014a72ad540cf..9d358a05c4e18ac6d5e115de111758ea6cdd37f2
 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1288,18 +1288,15 @@ static void __init register_page_bootmem_info(void)
 #endif
 }
 
-/*
- * Pre-allocates page-table pages for the vmalloc area in the kernel 
page-table.
- * Only the level which needs to be synchronized between all page-tables is
- * allocated because the synchronization can be expensive.
- */
-static void __init preallocate_vmalloc_pages(void)
+/* Initialize empty pagetables at the level below PGD.  */
+void __init preallocate_sub_pgd_pages(pgd_t *pgd_table, ulong start,
+

[PATCH RFC v2 13/29] mm: Add __PAGEFLAG_FALSE

2025-01-10 Thread Brendan Jackman

__PAGEFLAG_FALSE is a non-atomic equivalent of PAGEFLAG_FALSE.

Checkpatch-args: --ignore=COMPLEX_MACRO
Signed-off-by: Brendan Jackman 
---
 include/linux/page-flags.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 
cc839e4365c18223e68c35efd0f67e7650708e8b..7ee9a0edc6d21708fc93dfa8913dc1ae9478dee3
 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -484,6 +484,10 @@ static inline int Page##uname(const struct page *page) { 
return 0; }
 FOLIO_SET_FLAG_NOOP(lname) \
 static inline void SetPage##uname(struct page *page) {  }
 
+#define __SETPAGEFLAG_NOOP(uname, lname)   
\
+static inline void __folio_set_##lname(struct folio *folio) { }
\
+static inline void __SetPage##uname(struct page *page) {  }
+
 #define CLEARPAGEFLAG_NOOP(uname, lname)   \
 FOLIO_CLEAR_FLAG_NOOP(lname)   \
 static inline void ClearPage##uname(struct page *page) {  }
@@ -506,6 +510,9 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
 #define TESTSCFLAG_FALSE(uname, lname) \
TESTSETFLAG_FALSE(uname, lname) TESTCLEARFLAG_FALSE(uname, lname)
 
+#define __PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname, lname)
\
+   __SETPAGEFLAG_NOOP(uname, lname) __CLEARPAGEFLAG_NOOP(uname, lname)
+
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
 FOLIO_FLAG(waiters, FOLIO_HEAD_PAGE)
 FOLIO_FLAG(referenced, FOLIO_HEAD_PAGE)

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 20/29] mm: asi: Make TLB flushing correct under ASI

2025-01-10 Thread Brendan Jackman

This is the absolute minimum change for TLB flushing to be correct under
ASI. There are two arguably orthogonal changes in here but they feel
small enough for a single commit.

.:: CR3 stabilization

As noted in the comment ASI can destabilize CR3, but we can stabilize it
again by calling asi_exit, this makes it safe to read CR3 and write it
back.

This is enough to be correct - we don't have to worry about invalidating
the other ASI address space (i.e. we don't need to invalidate the
restricted address space if we are currently unrestricted / vice versa)
because we currently never set the noflush bit in CR3 for ASI
transitions.

Even without using CR3's noflush bit there are trivial optimizations
still on the table here: on where invpcid_flush_single_context is
available (i.e. with the INVPCID_SINGLE feature) we can use that in lieu
of the CR3 read/write, and avoid the extremely costly asi_exit.

.:: Invalidating kernel mappings

Before ASI, with KPTI off we always either disable PCID or use global
mappings for kernel memory. However ASI disables global kernel mappings
regardless of factors. So we need to invalidate other address spaces to
trigger a flush when we switch into them.

Note that there is currently a pointless write of
cpu_tlbstate.invalidate_other in the case of KPTI and !PCID. We've added
another case of that (ASI, !KPTI and !PCID). I think that's preferable
to expanding the conditional in flush_tlb_one_kernel.

Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/tlb.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 
ce5598f96ea7a84dc0e8623022ab5bfbba401b48..07b1657bee8e4cf17452ea57c838823e76f482c0
 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -231,7 +231,7 @@ static void clear_asid_other(void)
 * This is only expected to be set if we have disabled
 * kernel _PAGE_GLOBAL pages.
 */
-   if (!static_cpu_has(X86_FEATURE_PTI)) {
+   if (!static_cpu_has(X86_FEATURE_PTI) && !static_asi_enabled()) {
WARN_ON_ONCE(1);
return;
}
@@ -1040,7 +1040,6 @@ static void put_flush_tlb_info(void)
 noinstr u16 asi_pcid(struct asi *asi, u16 asid)
 {
return kern_pcid(asid) | ((asi->class_id + 1) << 
X86_CR3_ASI_PCID_BITS_SHIFT);
-   // return kern_pcid(asid) | ((asi->index + 1) << 
X86_CR3_ASI_PCID_BITS_SHIFT);
 }
 
 void asi_flush_tlb_range(struct asi *asi, void *addr, size_t len)
@@ -1192,15 +1191,19 @@ void flush_tlb_one_kernel(unsigned long addr)
 * use PCID if we also use global PTEs for the kernel mapping, and
 * INVLPG flushes global translations across all address spaces.
 *
-* If PTI is on, then the kernel is mapped with non-global PTEs, and
-* __flush_tlb_one_user() will flush the given address for the current
-* kernel address space and for its usermode counterpart, but it does
-* not flush it for other address spaces.
+* If PTI or ASI is on, then the kernel is mapped with non-global PTEs,
+* and __flush_tlb_one_user() will flush the given address for the
+* current kernel address space and, if PTI is on, for its usermode
+* counterpart, but it does not flush it for other address spaces.
 */
flush_tlb_one_user(addr);
 
-   if (!static_cpu_has(X86_FEATURE_PTI))
+   /* Nothing more to do if PTI and ASI are completely off. */
+   if (!static_cpu_has(X86_FEATURE_PTI) && !static_asi_enabled()) {
+   VM_WARN_ON_ONCE(static_cpu_has(X86_FEATURE_PCID) &&
+   !(__default_kernel_pte_mask & _PAGE_GLOBAL));
return;
+   }
 
/*
 * See above.  We need to propagate the flush to all other address
@@ -1289,6 +1292,16 @@ STATIC_NOPV void native_flush_tlb_local(void)
 
invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
+   /*
+* Restricted ASI CR3 is unstable outside of critical section, so we
+* couldn't flush via a CR3 read/write. asi_exit() stabilizes it.
+* We don't expect any flushes in a critical section.
+*/
+   if (WARN_ON(asi_in_critical_section()))
+   native_flush_tlb_global();
+   else
+   asi_exit();
+
/* If current->mm == NULL then the read_cr3() "borrows" an mm */
native_write_cr3(__native_read_cr3());
 }

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 18/29] mm: asi: Map dynamic percpu memory as nonsensitive

2025-01-10 Thread Brendan Jackman

From: Reiji Watanabe 

Currently, all dynamic percpu memory is implicitly (and
unintentionally) treated as sensitive memory.

Unconditionally map pages for dynamically allocated percpu
memory as global nonsensitive memory, other than pages that
are allocated for pcpu_{first,reserved}_chunk during early
boot via memblock allocator (these will be taken care by the
following patch).

We don't support sensitive percpu memory allocation yet.

Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Signed-off-by: Reiji Watanabe 
Signed-off-by: Brendan Jackman 

WIP: Drop VM_SENSITIVE checks from percpu code
---
 mm/percpu-vm.c | 50 --
 mm/percpu.c|  4 ++--
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 
cd69caf6aa8d8eded2395eb4bc4051b78ec6aa33..2935d7fbac41548819a94dcc60566bd18cde819a
 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -132,11 +132,20 @@ static void pcpu_pre_unmap_flush(struct pcpu_chunk *chunk,
pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
 }
 
-static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
+static void ___pcpu_unmap_pages(unsigned long addr, int nr_pages)
 {
vunmap_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT));
 }
 
+static void __pcpu_unmap_pages(unsigned long addr, int nr_pages,
+  unsigned long vm_flags)
+{
+   unsigned long size = nr_pages << PAGE_SHIFT;
+
+   asi_unmap(ASI_GLOBAL_NONSENSITIVE, (void *)addr, size);
+   ___pcpu_unmap_pages(addr, nr_pages);
+}
+
 /**
  * pcpu_unmap_pages - unmap pages out of a pcpu_chunk
  * @chunk: chunk of interest
@@ -153,6 +162,8 @@ static void __pcpu_unmap_pages(unsigned long addr, int 
nr_pages)
 static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
 struct page **pages, int page_start, int page_end)
 {
+   struct vm_struct **vms = (struct vm_struct **)chunk->data;
+   unsigned long vm_flags = vms ? vms[0]->flags : VM_ALLOC;
unsigned int cpu;
int i;
 
@@ -165,7 +176,7 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
pages[pcpu_page_idx(cpu, i)] = page;
}
__pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start),
-  page_end - page_start);
+  page_end - page_start, vm_flags);
}
 }
 
@@ -190,13 +201,38 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk 
*chunk,
pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
 }
 
-static int __pcpu_map_pages(unsigned long addr, struct page **pages,
-   int nr_pages)
+/*
+ * __pcpu_map_pages() should not be called during the percpu initialization,
+ * as asi_map() depends on the page allocator (which isn't available yet
+ * during percpu initialization).  Instead, ___pcpu_map_pages() can be used
+ * during the percpu initialization. But, any pages that are mapped with
+ * ___pcpu_map_pages() will be treated as sensitive memory, unless
+ * they are explicitly mapped with asi_map() later.
+ */
+static int ___pcpu_map_pages(unsigned long addr, struct page **pages,
+int nr_pages)
 {
return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
PAGE_KERNEL, pages, PAGE_SHIFT);
 }
 
+static int __pcpu_map_pages(unsigned long addr, struct page **pages,
+   int nr_pages, unsigned long vm_flags)
+{
+   unsigned long size = nr_pages << PAGE_SHIFT;
+   int err;
+
+   err = ___pcpu_map_pages(addr, pages, nr_pages);
+   if (err)
+   return err;
+
+   /*
+* If this fails, pcpu_map_pages()->__pcpu_unmap_pages() will call
+* asi_unmap() and clean up any partial mappings.
+*/
+   return asi_map(ASI_GLOBAL_NONSENSITIVE, (void *)addr, size);
+}
+
 /**
  * pcpu_map_pages - map pages into a pcpu_chunk
  * @chunk: chunk of interest
@@ -214,13 +250,15 @@ static int __pcpu_map_pages(unsigned long addr, struct 
page **pages,
 static int pcpu_map_pages(struct pcpu_chunk *chunk,
  struct page **pages, int page_start, int page_end)
 {
+   struct vm_struct **vms = (struct vm_struct **)chunk->data;
+   unsigned long vm_flags = vms ? vms[0]->flags : VM_ALLOC;
unsigned int cpu, tcpu;
int i, err;
 
for_each_possible_cpu(cpu) {
err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
   &pages[pcpu_page_idx(cpu, page_start)],
-  page_end - page_start);
+  page_end - page_start, vm_flags);
if (err < 0)
goto err;
 
@@ -232,7 +270,7 @@

[PATCH RFC v2 22/29] mm: asi: exit ASI before accessing CR3 from C code where appropriate

2025-01-10 Thread Brendan Jackman

Because asi_exit()s can be triggered by NMIs, CR3 is unstable when in
the ASI restricted address space. (Exception: code in the ASI critical
section can treat it as stable, because if an asi_exit() occurs during
an exception it will be undone before the critical section resumes).

Code that accesses CR3 needs to become aware of this. Most importantly:
if code reads CR3 and then writes a derived value back, if concurrent
asi_exit() occurred in between then the address space switch would be
undone, which would totally break ASI.

So, make sure that an asi_exit() is performed before accessing CR3.
Exceptions are made for cases that need to access the current CR3 value,
restricted or not, without exiting ASI.

(An alternative approach would be to enter an ASI critical section when
a stable CR3 is needed. This would be worth exploring if the ASI exits
introduced by this patch turned out to cause performance issues).

Add calls to asi_exit() to __native_read_cr3() and native_write_cr3(),
and introduce 'raw' variants that do not perform an ASI exit. Introduce
similar variants for wrappers: __read_cr3(), read_cr3_pa(), and
write_cr3(). A forward declaration of asi_exit(), because the one in
asm-generic/asi.h is only declared when !CONFIG_ADDRESS_SPACE_ISOLATION,
and arch/x86/asm/asi.h cannot be included either as it would cause a
circular dependency.

The 'raw' variants are used in the following cases:
- In __show_regs() where the actual values of registers are dumped for
  debugging.
- In dump_pagetable() and show_fault_oops() where the active page tables
  during a page fault are dumped for debugging.
- In switch_mm_verify_cr3() and cr3_matches_current_mm() where the
  actual value of CR3 is needed for a debug check, and the code
  explicitly checks for ASI-restricted CR3.
- In exc_page_fault() for ASI faults. The code is ASI-aware and
  explicitly performs an ASI exit before reading CR3.
- In load_new_mm_cr3() where a new CR3 is loaded during context
  switching. At this point, it is guaranteed that ASI already exited.
  Calling asi_exit() at that point, where loaded_mm ==
  LOADED_MM_SWITCHING, will cause VM_BUG_ON in asi_exit() to fire
  mistakenly even though loaded_mm is never accessed.
- In __get_current_cr3_fast(), as it is called from an ASI critical
  section and the value is only used for debug checking.
  In nested_vmx_check_vmentry_hw(), do an explicit asi_exit() before
  calling __get_current_cr3_fast(), since in that case we are not in a
  critical section and do need a stable CR3 value.
- In __asi_enter() and asi_exit() for obvious reasons.

- In vmx_set_constant_host_state() when CR3 is initialized in the VMCS
  with the most likely value. Preemption is enabled, so once ASI
  supports context switching exiting ASI will not be reliable as
  rescheduling may cause re-entering ASI. It doesn't matter if the wrong
  value of CR3 is used in this context, before entering the guest, ASI
  is either explicitly entered or exited, and CR3 is updated again in
  the VMCS if needed.
- In efi_5level_switch(), as it is called from startup_64_mixed_mode()
  during boot before ASI can be entered. startup_64_mixed_mode() is
  under arch/x86/boot/compressed/* and it cannot call asi_exit() anyway
  (see below).

Finally, code in arch/x86/boot/compressed/ident_map_64.c and
arch/x86/boot/compressed/pgtable_64.c extensively accesses CR3 during
boot. This code under arch/x86/boot/compressed/* cannot call asi_exit()
due to restriction on its compilation (it cannot use functions defined
in .c files outside the directory).

Instead of changing all CR3 accesses to use 'raw' variants, undefine
CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION in these files. This will make
the asi_exit() calls in CR3 helpers use the noop variant defined in
include/asm-generic/asi.h. This is fine because the code is executed
early in boot where asi_exit() would be noop anyway.

With this change, the number of existing *_cr3() calls are 44, and the
number of *_cr3_raw() are 22. The choice was made to make the existing
functions exit ASI by default and adding new variants that do not exit
ASI, because most call sites that use the new *_cr3_raw() variants are
either ASI-aware code or debugging code.

On the contrary, code that uses the existing variants is usually in
important code paths (e.g. TLB flushes) and is ignorant of ASI. Hence,
new code is most likely to be correct and less risky by using the
variants that exit ASI by default.

Signed-off-by: Yosry Ahmed 
Signed-off-by: Brendan Jackman 
---
 arch/x86/Kconfig|  2 +-
 arch/x86/boot/compressed/ident_map_64.c | 10 
 arch/x86/boot/compressed/pgtable_64.c   | 11 +
 arch/x86/include/asm/processor.h|  5 
 arch/x86/include/asm/special_insns.h| 41 +++--
 arch/x86/kernel/process_32.c|  2 +-
 arch/x86/kernel/process_64.c|  2 +-
 arch/x86/kvm/vmx/nested.c

[PATCH RFC v2 23/29] mm: asi: exit ASI before suspend-like operations

2025-01-10 Thread Brendan Jackman

From: Yosry Ahmed 

During suspend-like operations (suspend, hibernate, kexec w/
preserve_context), the processor state (including CR3) is usually saved
and restored later.

In the kexec case, this only happens when KEXEC_PRESERVE_CONTEXT is
used to jump back to the original kernel. In relocate_kernel(), some
registers including CR3 are stored in VA_CONTROL_PAGE. If
preserve_context is set (passed into relocate_kernel() in RCX), after
running the new kernel the code under 'virtual_mapped' restores these
registers. This is similar to what happens in suspend and hibernate.

Note that even when KEXEC_PRESERVE_CONTEXT is not set, relocate_kernel()
still accesses CR3. It mainly reads and writes it to flush the TLB. This
could be problematic and cause improper ASI enters (see below), but it
is assumed to be safe because the kernel will essentially reboot in this
case anyway.

Saving and restoring CR3 in this fashion can cause a problem if the
suspend/hibernate/kexec is performed within an ASI domain. A restricted
CR3 will be saved, and later restored after ASI had potentially already
exited (e.g. from an NMI after CR3 is stored). This will cause an
_improper_ ASI enter, where code starts executing in a restricted
address space, yet ASI metadata (especially curr_asi) says otherwise.

Exit ASI early in all these paths by registering a syscore_suspend()
callback. syscore_suspend() is called in all the above paths (for kexec,
only with KEXEC_PRESERVE_CONTEXT) after IRQs are finally disabled before
the operation. This is not currently strictly required but is convenient
because when ASI gains the ability to persist across context switching,
there will be additional synchronization requirements simplified by
this.

Note: If the CR3 accesses in relocate_kernel() when
KEXEC_PRESERVE_CONTEXT is not set are concerning, they could be handled
by registering a syscore_shutdown() callback to exit ASI.
syscore_shutdown() is called in the kexec path where
KEXEC_PRESERVE_CONTEXT is not set starting commit 7bb943806ff6 ("kexec:
do syscore_shutdown() in kernel_kexec").

Signed-off-by: Yosry Ahmed 
Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/asi.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
a9f9bfbf85eb47d16ef8d0bfbc7713f07052d3ed..c5073af1a82ded1c6fc467cd7a5d29a39d676bb4
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -243,6 +244,32 @@ static int asi_map_percpu(struct asi *asi, void 
*percpu_addr, size_t len)
return 0;
 }
 
+#ifdef CONFIG_PM_SLEEP
+static int asi_suspend(void)
+{
+   /*
+* Must be called after IRQs are disabled and rescheduling is no longer
+* possible (so that we cannot re-enter ASI before suspending.
+*/
+   lockdep_assert_irqs_disabled();
+
+   /*
+* Suspend operations sometimes save CR3 as part of the saved state,
+* which is restored later (e.g. do_suspend_lowlevel() in the suspend
+* path, swsusp_arch_suspend() in the hibernate path, relocate_kernel()
+* in the kexec path). Saving a restricted CR3 and restoring it later
+* could leave to improperly entering ASI. Exit ASI before such
+* operations.
+*/
+   asi_exit();
+   return 0;
+}
+
+static struct syscore_ops asi_syscore_ops = {
+   .suspend = asi_suspend,
+};
+#endif /* CONFIG_PM_SLEEP */
+
 static int __init asi_global_init(void)
 {
int err;
@@ -306,6 +333,10 @@ static int __init asi_global_init(void)
asi_clone_pgd(asi_global_nonsensitive_pgd, init_mm.pgd,
  VMEMMAP_START + (1UL << PGDIR_SHIFT));
 
+#ifdef CONFIG_PM_SLEEP
+   register_syscore_ops(&asi_syscore_ops);
+#endif
+
return 0;
 }
 subsys_initcall(asi_global_init)

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 21/29] KVM: x86: asi: Restricted address space for VM execution

2025-01-10 Thread Brendan Jackman

An ASI restricted address space is added for KVM. This protects the
userspace from attack by the guest, and the guest from attack by other
processes. It doesn't attempt to prevent the guest from attack by the
current process.

This change incorporates an extra asi_exit at the end of vcpu_run. We
expect later iterations of ASI to drop that call as we gain the
ability to context switch within the ASI domain.

Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/svm/svm.c  |  2 ++
 arch/x86/kvm/vmx/vmx.c  | 38 
 arch/x86/kvm/x86.c  | 77 -
 4 files changed, 105 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 
6d9f763a7bb9d5db422ea5625b2c28420bd14f26..00cda452dd6ca6ec57ff85ca194ee4aeb6af3be7
 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
 
@@ -1535,6 +1536,8 @@ struct kvm_arch {
 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+   struct asi *asi;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 
9df3e1e5ae81a1346409632edd693cb7e0740f72..f2c3154292b4f6c960b490b0773f53bea43897bb
 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4186,6 +4186,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu, bool spec_ctrl_in
guest_state_enter_irqoff();
 
amd_clear_divider();
+   asi_enter(vcpu->kvm->arch.asi);
 
if (sev_es_guest(vcpu->kvm))
__svm_sev_es_vcpu_run(svm, spec_ctrl_intercepted,
@@ -4193,6 +4194,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu, bool spec_ctrl_in
else
__svm_vcpu_run(svm, spec_ctrl_intercepted);
 
+   asi_relax();
guest_state_exit_irqoff();
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 
d28618e9277ede83ad2edc1b1778ea44123aa797..181d230b1c057fed33f7b29b7b0e378dbdfeb174
 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -7282,14 +7283,34 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
unsigned int flags)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   unsigned long cr3;
 
guest_state_enter_irqoff();
+   asi_enter(vcpu->kvm->arch.asi);
+
+   /*
+* Refresh vmcs.HOST_CR3 if necessary.  This must be done immediately
+* prior to VM-Enter, as the kernel may load a new ASID (PCID) any time
+* it switches back to the current->mm, which can occur in KVM context
+* when switching to a temporary mm to patch kernel code, e.g. if KVM
+* toggles a static key while handling a VM-Exit.
+* Also, this must be done after asi_enter(), as it changes CR3
+* when switching address spaces.
+*/
+   cr3 = __get_current_cr3_fast();
+   if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
+   vmcs_writel(HOST_CR3, cr3);
+   vmx->loaded_vmcs->host_state.cr3 = cr3;
+   }
 
/*
 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
 * mitigation for MDS is done late in VMentry and is still
 * executed in spite of L1D Flush. This is because an extra VERW
 * should not matter much after the big hammer L1D Flush.
+*
+* This is only after asi_enter() for performance reasons.
+* RFC: This also needs to be integrated with ASI's tainting model.
 */
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
@@ -7310,6 +7331,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
 
vmx->idt_vectoring_info = 0;
 
+   asi_relax();
+
vmx_enable_fb_clear(vmx);
 
if (unlikely(vmx->fail)) {
@@ -7338,7 +7361,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
 fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned long cr3, cr4;
+   unsigned long cr4;
 
/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!enable_vnmi &&
@@ -7381,19 +7404,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool 
force_immediate_exit)
vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
vcpu->arch.regs_dirty = 0;
 
-   /*
-* Refresh vmcs.HOST_CR3 if necessary.  This must be done immediately
-* prior to VM-Enter, as the kernel may load a new ASID (PCID) any

[PATCH RFC v2 24/29] mm: asi: Add infrastructure for mapping userspace addresses

2025-01-10 Thread Brendan Jackman

In preparation for sandboxing bare-metal processes, teach ASI to map
userspace addresses into the restricted address space.

Add a new policy helper to determine based on the class whether to do
this. If the helper returns true, mirror userspace mappings into the ASI
pagetables.

Later, it will be possible for users who do not have a significant
security boundary between KVM guests and their VMM process, to take
advantage of this to reduce mitigation costs when switching between
those two domains - to illustrate this idea, it's now reflected in the
KVM taint policy, although the KVM class is still hard-coded not to map
userspace addresses.

Co-developed-by: Junaid Shahid 
Signed-off-by: Junaid Shahid 
Co-developed-by: Reiji Watanabe 
Signed-off-by: Reiji Watanabe 
Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/asi.h| 11 +
 arch/x86/include/asm/pgalloc.h|  6 +++
 arch/x86/include/asm/pgtable_64.h |  4 ++
 arch/x86/kvm/x86.c| 12 +++--
 arch/x86/mm/asi.c | 92 +++
 include/asm-generic/asi.h |  4 ++
 6 files changed, 125 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
555edb5f292e4d6baba782f51d014aa48dc850b6..e925d7d2cfc85bca8480c837548654e7a5a7009e
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -133,6 +133,7 @@ struct asi {
struct mm_struct *mm;
int64_t ref_count;
enum asi_class_id class_id;
+   spinlock_t pgd_lock;
 };
 
 DECLARE_PER_CPU_ALIGNED(struct asi *, curr_asi);
@@ -147,6 +148,7 @@ const char *asi_class_name(enum asi_class_id class_id);
 
 int asi_init(struct mm_struct *mm, enum asi_class_id class_id, struct asi 
**out_asi);
 void asi_destroy(struct asi *asi);
+void asi_clone_user_pgtbl(struct mm_struct *mm, pgd_t *pgdp);
 
 /* Enter an ASI domain (restricted address space) and begin the critical 
section. */
 void asi_enter(struct asi *asi);
@@ -286,6 +288,15 @@ static __always_inline bool asi_in_critical_section(void)
 
 void asi_handle_switch_mm(void);
 
+/*
+ * This function returns true when we would like to map userspace addresses
+ * in the restricted address space.
+ */
+static inline bool asi_maps_user_addr(enum asi_class_id class_id)
+{
+   return false;
+}
+
 #endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
 
 #endif
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 
dcd836b59bebd329c3d265b98e48ef6eb4c9e6fc..edf9fe76c53369eefcd5bf14a09cbf802cf1ea21
 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -114,12 +114,16 @@ static inline void p4d_populate(struct mm_struct *mm, 
p4d_t *p4d, pud_t *pud)
 {
paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+   if (!pgtable_l5_enabled())
+   asi_clone_user_pgtbl(mm, (pgd_t *)p4d);
 }
 
 static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t 
*pud)
 {
paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+   if (!pgtable_l5_enabled())
+   asi_clone_user_pgtbl(mm, (pgd_t *)p4d);
 }
 
 extern void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud);
@@ -137,6 +141,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t 
*pgd, p4d_t *p4d)
return;
paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+   asi_clone_user_pgtbl(mm, pgd);
 }
 
 static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t 
*p4d)
@@ -145,6 +150,7 @@ static inline void pgd_populate_safe(struct mm_struct *mm, 
pgd_t *pgd, p4d_t *p4
return;
paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+   asi_clone_user_pgtbl(mm, pgd);
 }
 
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
diff --git a/arch/x86/include/asm/pgtable_64.h 
b/arch/x86/include/asm/pgtable_64.h
index 
d1426b64c1b9715cd9e4d1d7451ae4feadd8b2f5..fe6d83ec632a6894527784f2ebdbd013161c6f09
 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -157,6 +157,8 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 static inline void native_p4d_clear(p4d_t *p4d)
 {
native_set_p4d(p4d, native_make_p4d(0));
+   if (!pgtable_l5_enabled())
+   asi_clone_user_pgtbl(NULL, (pgd_t *)p4d);
 }
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
@@ -167,6 +169,8 @@ static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 static inline void native_pgd_clear(pgd_t *pgd)
 {
native_set_pgd(pgd, native_make_pgd(0));
+   if (pgtable_l5_enabled())
+   asi_clone_user_pgtbl(NULL, pgd);
 }
 
 /*
diff --git a/arch/x86/kvm/x86

[PATCH RFC v2 25/29] mm: asi: Restricted execution fore bare-metal processes

2025-01-10 Thread Brendan Jackman

Now userspace gets a restricted address space too. The critical section
begins on exit to userspace and ends when it makes a system call.
Other entries from userspace just interrupt the critical section via
asi_intr_enter().

The reason why system calls have to actually asi_relax() (i.e. fully
terminate the critical section instead of just interrupting it) is that
system calls are the type of kernel entry that can lead to transition
into a _different_ ASI domain, namely the KVM one: it is not supported
to transition into a different domain while a critical section exists
(i.e. while asi_state.target is not NULL), even if it has been paused by
asi_intr_enter() (i.e. even if asi_state.intr_nest_depth is nonzero) -
there must be an asi_relax() between any two asi_enter()s.

The restricted address space for bare-metal tasks naturally contains the
entire userspace address region, although the task's own memory is still
missing from the direct map.

This implementation creates new userspace-specific APIs for asi_init(),
asi_destroy() and asi_enter(), which seems a little ugly, maybe this
suggest a general rework of these APIs given that the "generic" version
only has one caller. For RFC code this seems good enough though.

Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/asi.h   |  8 ++--
 arch/x86/mm/asi.c| 49 
 include/asm-generic/asi.h|  9 +++-
 include/linux/entry-common.h | 11 ++
 init/main.c  |  2 ++
 kernel/entry/common.c|  1 +
 kernel/fork.c|  4 +++-
 7 files changed, 76 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 
e925d7d2cfc85bca8480c837548654e7a5a7009e..c3c1a57f0147ae9bd11d89c8bf7c8a4477728f51
 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -140,19 +140,23 @@ DECLARE_PER_CPU_ALIGNED(struct asi *, curr_asi);
 
 void asi_check_boottime_disable(void);
 
-void asi_init_mm_state(struct mm_struct *mm);
+int asi_init_mm_state(struct mm_struct *mm);
 
 int asi_init_class(enum asi_class_id class_id, struct asi_taint_policy 
*taint_policy);
+void asi_init_userspace_class(void);
 void asi_uninit_class(enum asi_class_id class_id);
 const char *asi_class_name(enum asi_class_id class_id);
 
 int asi_init(struct mm_struct *mm, enum asi_class_id class_id, struct asi 
**out_asi);
 void asi_destroy(struct asi *asi);
+void asi_destroy_userspace(struct mm_struct *mm);
 void asi_clone_user_pgtbl(struct mm_struct *mm, pgd_t *pgdp);
 
 /* Enter an ASI domain (restricted address space) and begin the critical 
section. */
 void asi_enter(struct asi *asi);
 
+void asi_enter_userspace(void);
+
 /*
  * Leave the "tense" state if we are in it, i.e. end the critical section. We
  * will stay relaxed until the next asi_enter.
@@ -294,7 +298,7 @@ void asi_handle_switch_mm(void);
  */
 static inline bool asi_maps_user_addr(enum asi_class_id class_id)
 {
-   return false;
+   return class_id == ASI_CLASS_USERSPACE;
 }
 
 #endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
093103c1bc2677c81d68008aca064fab53b73a62..1e9dc568e79e8686a4dbf47f765f2c2535d025ec
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -25,6 +25,7 @@ const char *asi_class_names[] = {
 #if IS_ENABLED(CONFIG_KVM)
[ASI_CLASS_KVM] = "KVM",
 #endif
+   [ASI_CLASS_USERSPACE] = "userspace",
 };
 
 DEFINE_PER_CPU_ALIGNED(struct asi *, curr_asi);
@@ -67,6 +68,32 @@ int asi_init_class(enum asi_class_id class_id, struct 
asi_taint_policy *taint_po
 }
 EXPORT_SYMBOL_GPL(asi_init_class);
 
+void __init asi_init_userspace_class(void)
+{
+   static struct asi_taint_policy policy = {
+   /*
+* Prevent going to userspace with sensitive data potentially
+* left in sidechannels by code running in the unrestricted
+* address space, or another MM. Note we don't check for guest
+* data here. This reflects the assumption that the guest trusts
+* its VMM (absent fancy HW features, which are orthogonal).
+*/
+   .protect_data = ASI_TAINT_KERNEL_DATA | ASI_TAINT_OTHER_MM_DATA,
+   /*
+* Don't go into userspace with control flow state controlled by
+* other processes, or any KVM guest the process is running.
+* Note this bit is about protecting userspace from other parts
+* of the system, while data_taints is about protecting other
+* parts of the system from the guest.
+*/
+   .prevent_control = ASI_TAINT_GUEST_CONTROL | 
ASI_TAINT_OTHER_MM_CONTROL,
+   .set = ASI_TAINT_USER_CONTROL | ASI_TAINT_USER_DATA,
+   };
+   int err = asi_init_class(ASI_CLASS_USERSPACE, &policy

[PATCH RFC v2 26/29] x86: Create library for flushing L1D for L1TF

2025-01-10 Thread Brendan Jackman

ASI will need to use this L1D flushing logic so put it in a library
where it can be used independently of KVM.

Since we're creating this library, it starts to look messy if we don't
also use it in the double-opt-in (both kernel cmdline and prctl)
mm-switching flush logic which is there for mitigating Snoop-Assisted L1
Data Sampling ("SAL1DS"). However, that logic doesn't use any
software-based fallback for flushing on CPUs without the L1D_FLUSH
command. In that case the prctl opt-in will fail.

One option would be to just start using the software fallback sequence
currently done by VMX code, but Linus didn't seem happy with a similar
sequence being used here [1]. CPUs affected by SAL1DS are a subset of
those affected by L1TF, so it wouldn't be completely insane to assume
that the same sequence works for both cases, but I'll err on the side of
caution and avoid risk of giving users a false impression that the
kernel has really flushed L1D for them.

[1] 
https://lore.kernel.org/linux-kernel/CAHk-=whc4puhercodhcbtodmppy-pj8j9ytsdcyz9torob4...@mail.gmail.com/

Instead, create this awkward library that is scoped specifically to L1TF,
which will be used only by VMX and ASI, and has an annoying "only
sometimes works" doc-comment. Users of the library can then infer from
that comment whether they have flushed L1D.

No functional change intended.

Checkpatch-args: --ignore=COMMIT_LOG_LONG_LINE
Signed-off-by: Brendan Jackman 
---
 arch/x86/Kconfig|  4 ++
 arch/x86/include/asm/l1tf.h | 11 ++
 arch/x86/kvm/Kconfig|  1 +
 arch/x86/kvm/vmx/vmx.c  | 66 +++
 arch/x86/lib/Makefile   |  1 +
 arch/x86/lib/l1tf.c | 94 +
 6 files changed, 117 insertions(+), 60 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 
ae31f36ce23d7c29d1e90b726c5a2e6ea5a63c8d..ca984dc7ee2f2b68c3ce1bcb5055047ca4f2a65d
 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2523,6 +2523,7 @@ config MITIGATION_ADDRESS_SPACE_ISOLATION
bool "Allow code to run with a reduced kernel address space"
default n
depends on X86_64 && !PARAVIRT && !UML
+   select X86_L1TF_FLUSH_LIB
help
  This feature provides the ability to run some kernel code
  with a reduced kernel address space. This can be used to
@@ -3201,6 +3202,9 @@ config HAVE_ATOMIC_IOMAP
def_bool y
depends on X86_32
 
+config X86_L1TF_FLUSH_LIB
+   def_bool n
+
 source "arch/x86/kvm/Kconfig"
 
 source "arch/x86/Kconfig.assembler"
diff --git a/arch/x86/include/asm/l1tf.h b/arch/x86/include/asm/l1tf.h
new file mode 100644
index 
..e0be19c588bb5ec5c76a1861492e48b88615b4b8
--- /dev/null
+++ b/arch/x86/include/asm/l1tf.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_L1TF_FLUSH_H
+#define _ASM_L1TF_FLUSH_H
+
+#ifdef CONFIG_X86_L1TF_FLUSH_LIB
+int l1tf_flush_setup(void);
+void l1tf_flush(void);
+#endif /* CONFIG_X86_L1TF_FLUSH_LIB */
+
+#endif
+
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 
f09f13c01c6bbd28fa37fdf50547abf4403658c9..81c71510e33e52447882ab7b22682199c57b492e
 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -92,6 +92,7 @@ config KVM_SW_PROTECTED_VM
 config KVM_INTEL
tristate "KVM for Intel (and compatible) processors support"
depends on KVM && IA32_FEAT_CTL
+   select X86_L1TF_FLUSH_LIB
help
  Provides support for KVM on processors equipped with Intel's VT
  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 
0e90463f1f2183b8d716f85d5c8a8af8958fef0b..b1a02f27b3abce0ef6ac448b66bef2c653a52eef
 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -250,9 +251,6 @@ static void *vmx_l1d_flush_pages;
 
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 {
-   struct page *page;
-   unsigned int i;
-
if (!boot_cpu_has_bug(X86_BUG_L1TF)) {
l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
return 0;
@@ -288,26 +286,11 @@ static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state 
l1tf)
l1tf = VMENTER_L1D_FLUSH_ALWAYS;
}
 
-   if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages &&
-   !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) {
-   /*
-* This allocation for vmx_l1d_flush_pages is not tied to a VM
-* lifetime and so should not be charged to a memcg.
-*/
-   page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER);
-   if (!page)
-   return -ENOMEM;
-

[PATCH RFC v2 28/29] x86/pti: Disable PTI when ASI is on

2025-01-10 Thread Brendan Jackman

Now that ASI has support for sandboxing userspace, although userspace
now has much more mapped than it would under KPTI, in theory none of
that data is important to protect.

Note that one particular impact of this is it makes locally defeating
KASLR easier. I don't think this is a great loss given [1] etc.

Why do we pass in an argument instead of just having
pti_check_boottime_disable() check boot_cpu_has(X86_FEATURE_ASI)? Just
for clarity: I wanted it to be at least _sort of_ visible that it would
break if you reordered asi_check_boottime_disable() afterwards.

[1]:  https://gruss.cc/files/prefetch.pdf
  and https://dl.acm.org/doi/pdf/10.1145/3623652.3623669

Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/pti.h |  6 --
 arch/x86/mm/init.c |  2 +-
 arch/x86/mm/pti.c  | 14 +-
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 
ab167c96b9ab474b33d778453db0bb550f42b0ac..79b9ba927db9b76ac3cc72cdda6f8b5fc413d352
 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -3,12 +3,14 @@
 #define _ASM_X86_PTI_H
 #ifndef __ASSEMBLY__
 
+#include 
+
 #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
 extern void pti_init(void);
-extern void pti_check_boottime_disable(void);
+extern void pti_check_boottime_disable(bool asi_enabled);
 extern void pti_finalize(void);
 #else
-static inline void pti_check_boottime_disable(void) { }
+static inline void pti_check_boottime_disable(bool asi_enabled) { }
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 
ded3a47f2a9c1f554824d4ad19f3b48bce271274..4ccf6d60705652805342abefc5e71cd00c563207
 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -754,8 +754,8 @@ void __init init_mem_mapping(void)
 {
unsigned long end;
 
-   pti_check_boottime_disable();
asi_check_boottime_disable();
+   pti_check_boottime_disable(boot_cpu_has(X86_FEATURE_ASI));
probe_page_size_mask();
setup_pcid();
 
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 
851ec8f1363a8b389ea4579cc68bf3300a4df27c..b7132080d3c9b6962a0252383190335e171bafa6
 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -76,7 +76,7 @@ static enum pti_mode {
PTI_FORCE_ON
 } pti_mode;
 
-void __init pti_check_boottime_disable(void)
+void __init pti_check_boottime_disable(bool asi_enabled)
 {
if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
pti_mode = PTI_FORCE_OFF;
@@ -91,6 +91,18 @@ void __init pti_check_boottime_disable(void)
return;
}
 
+   if (asi_enabled) {
+   /*
+* Having both ASI and PTI enabled is not a totally ridiculous
+* thing to do; if you want ASI but you are not confident in the
+* sensitivity annotations then it provides useful
+* defence-in-depth. But, the implementation doesn't support it.
+*/
+   if (pti_mode != PTI_FORCE_OFF)
+   pti_print_if_insecure("disabled by ASI");
+   return;
+   }
+
if (pti_mode == PTI_FORCE_ON)
pti_print_if_secure("force enabled on command line.");
 

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 19/29] mm: asi: Stabilize CR3 in switch_mm_irqs_off()

2025-01-10 Thread Brendan Jackman

An ASI-restricted CR3 is unstable as interrupts can cause ASI-exits.
Although we already unconditionally ASI-exit during context-switch, and
before returning from the VM-run path, it's still possible to reach
switch_mm_irqs_off() in a restricted context, because KVM code updates
static keys, which requires using a temporary mm.

Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/tlb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 
c55733e144c7538ce7f97b74ea2b1b9c22497c32..ce5598f96ea7a84dc0e8623022ab5bfbba401b48
 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -546,6 +546,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct 
mm_struct *next,
bool need_flush;
u16 new_asid;
 
+   /* Stabilize CR3, before reading or writing CR3 */
+   asi_exit();
+
/* We don't want flush_tlb_func() to run concurrently with us. */
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 29/29] mm: asi: Stop ignoring asi=on cmdline flag

2025-01-10 Thread Brendan Jackman

At this point the minimum requirements are in place for the kernel to
operate correctly with ASI enabled.

Signed-off-by: Brendan Jackman 
---
 arch/x86/mm/asi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
f10f6614b26148e5ba423d8a44f640674573ee40..3e3956326936ea8550308ad004dbbb3738546f9f
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -207,14 +207,14 @@ void __init asi_check_boottime_disable(void)
pr_info("ASI disabled through kernel command line.\n");
} else if (ret == 2 && !strncmp(arg, "on", 2)) {
enabled = true;
-   pr_info("Ignoring asi=on param while ASI implementation is 
incomplete.\n");
+   pr_info("ASI enabled through kernel command line.\n");
} else {
pr_info("ASI %s by default.\n",
enabled ? "enabled" : "disabled");
}
 
if (enabled)
-   pr_info("ASI enablement ignored due to incomplete 
implementation.\n");
+   setup_force_cpu_cap(X86_FEATURE_ASI);
 }
 
 /*

-- 
2.47.1.613.gc27f4b7a9f-goog

[PATCH RFC v2 27/29] mm: asi: Add some mitigations on address space transitions

2025-01-10 Thread Brendan Jackman

Here we ASI actually starts becoming a real exploit mitigation,

On CPUs with L1TF, flush L1D when the ASI data taints say so.

On all CPUs, do some general branch predictor clearing
whenever the control taints say so.

This policy is very much just a starting point for discussion.
Primarily it's a vague gesture at the fact that there is leeway
in how ASI is used: it can be used to target CPU-specific issues (as
is the case for L1TF here), or it can be used as a fairly broad
mitigation (asi_maybe_flush_control() mitigates several known
Spectre-style attacks and very likely also some unknown ones).

Signed-off-by: Brendan Jackman 
---
 arch/x86/include/asm/nospec-branch.h |  2 ++
 arch/x86/kvm/vmx/vmx.c   |  1 +
 arch/x86/lib/l1tf.c  |  2 ++
 arch/x86/lib/retpoline.S | 10 ++
 arch/x86/mm/asi.c| 29 +
 5 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/nospec-branch.h 
b/arch/x86/include/asm/nospec-branch.h
index 
96b410b1d4e841eb02f53a4691ee794ceee4ad2c..4582fb1fb42f6fd226534012d969ed13085e943a
 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -614,6 +614,8 @@ static __always_inline void mds_idle_clear_cpu_buffers(void)
mds_clear_cpu_buffers();
 }
 
+extern void fill_return_buffer(void);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_NOSPEC_BRANCH_H_ */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 
b1a02f27b3abce0ef6ac448b66bef2c653a52eef..a532783caaea97291cd92a2e2cac617f74f76c7e
 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6635,6 +6635,7 @@ int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t 
exit_fastpath)
return ret;
 }
 
+/* Must be reentrant, for use by vmx_post_asi_enter. */
 static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
/*
diff --git a/arch/x86/lib/l1tf.c b/arch/x86/lib/l1tf.c
index 
c474f18ae331c8dfa7a029c457dd3cf75bebf808..ffe1c3d0ef43ff8f1781f2e446aed041f4ce3179
 100644
--- a/arch/x86/lib/l1tf.c
+++ b/arch/x86/lib/l1tf.c
@@ -46,6 +46,8 @@ EXPORT_SYMBOL(l1tf_flush_setup);
  *  - may or may not work on other CPUs.
  *
  * Don't call unless l1tf_flush_setup() has returned successfully.
+ *
+ * Must be reentrant, for use by ASI.
  */
 noinstr void l1tf_flush(void)
 {
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 
391059b2c6fbc4a571f0582c7c4654147a930cef..6d126fff6bf839889086fe21464d8af07316d7e5
 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -396,3 +396,13 @@ SYM_CODE_END(__x86_return_thunk)
 EXPORT_SYMBOL(__x86_return_thunk)
 
 #endif /* CONFIG_MITIGATION_RETHUNK */
+
+.pushsection .noinstr.text, "ax"
+SYM_CODE_START(fill_return_buffer)
+   UNWIND_HINT_FUNC
+   ENDBR
+   __FILL_RETURN_BUFFER(%_ASM_AX,RSB_CLEAR_LOOPS)
+   RET
+SYM_CODE_END(fill_return_buffer)
+__EXPORT_THUNK(fill_return_buffer)
+.popsection
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 
1e9dc568e79e8686a4dbf47f765f2c2535d025ec..f10f6614b26148e5ba423d8a44f640674573ee40
 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -38,6 +39,8 @@ struct asi __asi_global_nonsensitive = {
.mm = &init_mm,
 };
 
+static bool do_l1tf_flush __ro_after_init;
+
 static inline bool asi_class_id_valid(enum asi_class_id class_id)
 {
return class_id >= 0 && class_id < ASI_MAX_NUM_CLASSES;
@@ -361,6 +364,15 @@ static int __init asi_global_init(void)
asi_clone_pgd(asi_global_nonsensitive_pgd, init_mm.pgd,
  VMEMMAP_START + (1UL << PGDIR_SHIFT));
 
+   if (boot_cpu_has_bug(X86_BUG_L1TF)) {
+   int err = l1tf_flush_setup();
+
+   if (err)
+   pr_warn("Failed to setup L1TF flushing for ASI (%pe)", 
ERR_PTR(err));
+   else
+   do_l1tf_flush = true;
+   }
+
 #ifdef CONFIG_PM_SLEEP
register_syscore_ops(&asi_syscore_ops);
 #endif
@@ -512,10 +524,12 @@ static __always_inline void maybe_flush_control(struct 
asi *next_asi)
if (!taints)
return;
 
-   /*
-* This is where we'll do the actual dirty work of clearing uarch state.
-* For now we just pretend, clear the taints.
-*/
+   /* Clear normal indirect branch predictions, if we haven't */
+   if (cpu_feature_enabled(X86_FEATURE_IBPB))
+   __wrmsr(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, 0);
+
+   fill_return_buffer();
+
this_cpu_and(asi_taints, ~ASI_TAINTS_CONTROL_MASK);
 }
 
@@ -536,10 +550,9 @@ static __always_inline void maybe_flush_data(struct asi 
*next_asi)
if (!taints)
return;
 
-   /*
-* This is where we'll do the actual dirty work of clearing uarch state.
-

Re: [PATCH RFC v2 01/29] mm: asi: Make some utility functions noinstr compatible

2025-01-16 Thread Brendan Jackman

On Thu, 16 Jan 2025 at 01:21, Borislav Petkov  wrote:
> > Unfortunately Thomas pointed out this will prevent the function from
> > being inlined at call sites in .text.
> >
> > So far I haven't been able[1] to find a formulation that lets us :
> > 1. avoid calls from .noinstr.text -> .text,
> > 2. while also letting the compiler freely decide what to inline.
> >
> > 1 is a functional requirement so here I'm just giving up on 2. Existing
> > callsites of this code are just forced inline. For the incoming code
> > that needs to call it from noinstr, they will be out-of-line calls.
>
> I'm not sure some of that belongs in the commit message - if you want to have
> it in the submission, you should put it under the --- line below, right above
> the diffstat.

Sure. I'm actually not even sure that for a [PATCH]-quality thing this
cross-cutting commit even makes sense - once we've decided on the
general way to solve this problem, perhaps the changes should just be
part of the commit that needs them?

It feels messy to have a patch that "does multiple things", but on the
other hand it might be annoying to review a patch that says "make a
load of random changes across the kernel, which are needed at various
points in various upcoming patches, trust me".

Do you have any opinion on that?

(BTW, since a comment you made on another series (can't find it on
Lore...), I've changed my writing style to avoid stuff like this in
comments & commit messages in general, but this text all predates
that. I'll do my best to sort all that stuff out before I send
anything as a [PATCH].)

On Thu, 16 Jan 2025 at 11:29, Borislav Petkov  wrote:
>
> On Thu, Jan 16, 2025 at 01:18:58AM +0100, Borislav Petkov wrote:
> > Long story short, lemme try to poke around tomorrow to try to figure out 
> > what
> > actually happens. It could be caused by the part of Rik's patches and this 
> > one
> > inlining things. We'll see...
>
> Looks transient... The very similar guest boots fine on another machine. Let's
> watch this...

Oh, I didn't notice your update until now. But yeah I also couldn't
reproduce it on a Sapphire Rapids machine and on QEMU with this patch
applied on top of tip/master (37bc915c6ad0f).

Re: [PATCH RFC v2 16/29] mm: asi: Map kernel text and static data as nonsensitive

2025-01-17 Thread Brendan Jackman

On Fri, 10 Jan 2025 at 19:41, Brendan Jackman  wrote:
> +   asi_clone_pgd(asi_global_nonsensitive_pgd, init_mm.pgd, 
> VMEMMAP_START);
> +   asi_clone_pgd(asi_global_nonsensitive_pgd, init_mm.pgd,
> + VMEMMAP_START + (1UL << PGDIR_SHIFT));

There's a bug here that Yosry has fixed in our internal version, I
neglected to incorporate that here.

Under KASLR, vmemmap is not necessarily exactly 2 PGDs like this is
assuming. In fact it can share a PGD entry with the vmalloc area. So
to be correct this cloning logic needs to actually look at the
alignment and then navigate the page table hierarchy appropriately.

To be fixed for the next version.

As Yosry noted internally we also need to think about vmmemap getting
updated under memory hotplug.

Re: [PATCH RFC v2 04/29] mm: asi: Add infrastructure for boot-time enablement

2025-03-20 Thread Brendan Jackman

On Wed Mar 19, 2025 at 6:47 PM UTC, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 06:29:35PM +0100, Borislav Petkov wrote:
> > On Fri, Jan 10, 2025 at 06:40:30PM +, Brendan Jackman wrote:
> > > Add a boot time parameter to control the newly added X86_FEATURE_ASI.
> > > "asi=on" or "asi=off" can be used in the kernel command line to enable
> > > or disable ASI at boot time. If not specified, ASI enablement depends
> > > on CONFIG_ADDRESS_SPACE_ISOLATION_DEFAULT_ON, which is off by default.
> > 
> > I don't know yet why we need this default-on thing...
>
> It's a convenience to avoid needing to set asi=on if you want ASI to be
> on by default. It's similar to HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
> or ZSWAP_DEFAULT_ON.
>
> [..]
> > > @@ -175,7 +184,11 @@ static __always_inline bool asi_is_restricted(void)
> > >   return (bool)asi_get_current();
> > >  }
> > >  
> > > -/* If we exit/have exited, can we stay that way until the next 
> > > asi_enter? */
> > > +/*
> > > + * If we exit/have exited, can we stay that way until the next asi_enter?
> > 
> > What is that supposed to mean here?
>
> asi_is_relaxed() checks if the thread is outside an ASI critical
> section.
>
> I say "the thread" because it will also return true if we are executing
> an interrupt that arrived during the critical section, even though the
> interrupt handler is not technically part of the critical section.
>
> Now the reason it says "if we exit we stay that way" is probably
> referring to the fact that an asi_exit() when interrupting a critical
> section will be undone in the interrupt epilogue by re-entering ASI.
>
> I agree the wording here is confusing. We should probably describe this
> more explicitly and probably rename the function after the API
> discussions you had in the previous patch.

Yeah, this is confusing. It's trying to very concisely define the
concept of "relaxed" but now I see it through Boris' eyes I realise
it's really unhelpful to try and do that. And yeah we should probably
just rework the terminology/API.

To re-iterate what Yosry said, aside from my too-clever comment style
the more fundamental thing that's confusing here is that, using the
terminology currently in the code there are two concepts at play:

- The critical section: this is the path from asi_enter() to
  asi_relax(). The critical section can be interrupted, and code
  running in those interupts is not said to be "in the critical
  section".

- Being "tense" vs "relaxed". Being "tense" means the _task_ is in a
  critical section, but the current code might not be.

This distinction is theoretically relevant because e.g. it's a bug to
access sensitive data in a critical section, but it's OK to access it
while in the tense state (we will switch to the restricted address
space, but this is OK because we will have a chance to asi_enter()
again before we get back to the untrusted code). 

BTW, just to be clear:

1. Both of these are only relevant to code that's pretty deeply aware
   of ASI. (TLB flushing code, entry code, stuff like that).

2. To be honest whenever you write:

 if (asi_in_critical_section())

   You probably mean:

 if (WARN_ON(asi_in_critical_section()))

   For example if we try to flush the TLB in the critical section,
   there's a thing we can do to handle it. But that really shouldn't
   be necessary.  We want the critical section code to be very small
   and straight-line code.

   And indeed in the present code we don't use
   asi_in_critical_section() for anything bur WARNing.

> asi_is_relaxed() checks if the thread is outside an ASI critical
> section.

Now I see it written this way, this is probably the best way to
conceptualise it. Instead of having two concepts "tense/relaxed" vs
"ASI critical section" we could just say "the task is in a critical
section" vs "the CPU is in a critical section". So we could have
something like:

bool asi_task_critical(void);
bool asi_cpu_critical(void);

(They could also accept an argument for the task/CPU, but I can't see
any reason why you'd peek at another context like that).

--

For everything else, Ack to Boris or +1 to Yosry respectively.

Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

2025-02-19 Thread Brendan Jackman

On Wed, 19 Feb 2025 at 11:57, Borislav Petkov  wrote:

> > + * Runtime usage:
> > + *
> > + * 1. Call asi_enter() to switch to the restricted address space. This
> can't be
> > + *from an interrupt or exception handler and preemption must be
> disabled.
> > + *
> > + * 2. Execute untrusted code.
> > + *
> > + * 3. Call asi_relax() to inform the ASI subsystem that untrusted code
> execution
> > + *is finished. This doesn't cause any address space change. This
> can't be
> > + *from an interrupt or exception handler and preemption must be
> disabled.
> > + *
> > + * 4. Either:
> > + *
> > + *a. Go back to 1.
> > + *
> > + *b. Call asi_exit() before returning to userspace. This immediately
> > + *   switches to the unrestricted address space.
>
> So only from reading this, it does sound weird. Maybe the code does it
> differently - I'll see soon - but this basically says:
>
> I asi_enter(), do something, asi_relax() and then I decide to do something
> more and to asi_enter() again!? And then I can end it all with a *single*
> asi_exit() call?
>
> Hm, definitely weird API. Why?
>

OK, sounds like I need to rewrite this explanation! It's only been read
before by people who already knew how this thing worked so this might take
a few attempts to make it clear.

Maybe the best way to make it clear is to explain this with reference to
KVM. At a super high level, That looks like:

ioctl(KVM_RUN) {
enter_from_user_mode()
while !need_userspace_handling() {
asi_enter();  // part 1
vmenter();  // part 2
asi_relax(); // part 3
}
asi _exit(); // part 4b
exit_to_user_mode()
}

So part 4a is just referring to continuation of the loop.

This explanation was written when that was the only user of this API so it
was probably clearer, now we have userspace it seems a bit odd.

With my pseudocode above, does it make more sense? If so I'll try to think
of a better way to explain it.

/*
>  * Leave the "tense" state if we are in it, i.e. end the critical section.
> We
>  * will stay relaxed until the next asi_enter.
>  */
> void asi_relax(void);
>
> Yeah, so there's no API functions balance between enter() and relax()...
>

asi_enter() is actually balanced with asi_relax(). The comment says "if we
are in it" because technically if you call this asi_relax() outside of the
critical section, it's a nop. But, there's no reason to do that, so we
could definitely change the comment and WARN if that happens.

> > +#define ASI_TAINT_OTHER_MM_CONTROL   ((asi_taints_t)BIT(6))
> > +#define ASI_NUM_TAINTS   6
> > +static_assert(BITS_PER_BYTE * sizeof(asi_taints_t) >= ASI_NUM_TAINTS);
>
> Why is this a typedef at all to make the code more unreadable than it
> needs to
> be? Why not a simple unsigned int or char or whatever you need?
>

My thinking was just that it's nicer to see asi_taints_t and know that it
means "it holds taint flags and it's big enough" instead of having to
remember the space needed for these flags. But yeah I'm fine with making it
a raw integer type.

> +#define ASI_TAINTS_CONTROL_MASK \
> > + (ASI_TAINT_USER_CONTROL | ASI_TAINT_GUEST_CONTROL |
> ASI_TAINT_OTHER_MM_CONTROL)
> > +
> > +#define ASI_TAINTS_DATA_MASK \
> > + (ASI_TAINT_KERNEL_DATA | ASI_TAINT_USER_DATA |
> ASI_TAINT_OTHER_MM_DATA)
> > +
> > +#define ASI_TAINTS_GUEST_MASK (ASI_TAINT_GUEST_DATA |
> ASI_TAINT_GUEST_CONTROL)
> > +#define ASI_TAINTS_USER_MASK (ASI_TAINT_USER_DATA |
> ASI_TAINT_USER_CONTROL)
> > +
> > +/* The taint policy tells ASI how a class interacts with the CPU taints
> */
> > +struct asi_taint_policy {
> > + /*
> > +  * What taints would necessitate a flush when entering the domain,
> to
> > +  * protect it from attack by prior domains?
> > +  */
> > + asi_taints_t prevent_control;
>
> So if those necessitate a flush, why isn't this var called
> "taints_to_flush"
> or whatever which exactly explains what it is?
>

Well it needs to be disambiguated from the field below (currently
protect_data) but it could be control_to_flush (and data_to_flush).

The downside of that is that having one say "prevent" and one say "protect"
is quite meaningful. prevent_control is describing things we need to do to
protect the system from this domain, protect_data is about protecting the
domain from the system. However, while that difference is meaningful it
might not actually be helpful for the reader of the code so I'm not wed to
it.

Also worth noting that we could just combine these fields. At present they
should have disjoint bits set. But, they're used in separate contexts and
have separate (although conceptually very similar) meanings, so I think
that would reduce clarity.

> Spellchecker please. Go over your whole set.
>

Ack, I've set up a local thingy to spellcheck all my commits so hopefully
you should encounter less of that noise in future.

For the pronouns stuff I will do my best but you might still spot
violations in older text, sorr

Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

2025-02-19 Thread Brendan Jackman

Argh, sorry, GMail switched back to HTML mode somehow. Maybe I have to
get a proper mail client after all.

Here's the clean version.

On Wed, 19 Feb 2025 at 11:57, Borislav Petkov  wrote:
>
> > + * Runtime usage:
> > + *
> > + * 1. Call asi_enter() to switch to the restricted address space. This 
> > can't be
> > + *from an interrupt or exception handler and preemption must be 
> > disabled.
> > + *
> > + * 2. Execute untrusted code.
> > + *
> > + * 3. Call asi_relax() to inform the ASI subsystem that untrusted code 
> > execution
> > + *is finished. This doesn't cause any address space change. This can't 
> > be
> > + *from an interrupt or exception handler and preemption must be 
> > disabled.
> > + *
> > + * 4. Either:
> > + *
> > + *a. Go back to 1.
> > + *
> > + *b. Call asi_exit() before returning to userspace. This immediately
> > + *   switches to the unrestricted address space.
>
> So only from reading this, it does sound weird. Maybe the code does it
> differently - I'll see soon - but this basically says:
>
> I asi_enter(), do something, asi_relax() and then I decide to do something
> more and to asi_enter() again!? And then I can end it all with a *single*
> asi_exit() call?
>
> Hm, definitely weird API. Why?

OK, sounds like I need to rewrite this explanation! It's only been
read before by people who already knew how this thing worked so this
might take a few attempts to make it clear.

Maybe the best way to make it clear is to explain this with reference
to KVM. At a super high level, That looks like:

ioctl(KVM_RUN) {
enter_from_user_mode()
while !need_userspace_handling() {
asi_enter();  // part 1
vmenter();  // part 2
asi_relax(); // part 3
}
asi _exit(); // part 4b
exit_to_user_mode()
}

So part 4a is just referring to continuation of the loop.

This explanation was written when that was the only user of this API
so it was probably clearer, now we have userspace it seems a bit odd.

With my pseudocode above, does it make more sense? If so I'll try to
think of a better way to explain it.

> /*
>  * Leave the "tense" state if we are in it, i.e. end the critical section. We
>  * will stay relaxed until the next asi_enter.
>  */
> void asi_relax(void);
>
> Yeah, so there's no API functions balance between enter() and relax()...

asi_enter() is actually balanced with asi_relax(). The comment says
"if we are in it" because technically if you call this asi_relax()
outside of the critical section, it's a nop. But, there's no reason to
do that, so we could definitely change the comment and WARN if that
happens.

>
> > +#define ASI_TAINT_OTHER_MM_CONTROL   ((asi_taints_t)BIT(6))
> > +#define ASI_NUM_TAINTS   6
> > +static_assert(BITS_PER_BYTE * sizeof(asi_taints_t) >= ASI_NUM_TAINTS);
>
> Why is this a typedef at all to make the code more unreadable than it needs to
> be? Why not a simple unsigned int or char or whatever you need?

My thinking was just that it's nicer to see asi_taints_t and know that
it means "it holds taint flags and it's big enough" instead of having
to remember the space needed for these flags. But yeah I'm fine with
making it a raw integer type.

> > +#define ASI_TAINTS_CONTROL_MASK \
> > + (ASI_TAINT_USER_CONTROL | ASI_TAINT_GUEST_CONTROL | 
> > ASI_TAINT_OTHER_MM_CONTROL)
> > +
> > +#define ASI_TAINTS_DATA_MASK \
> > + (ASI_TAINT_KERNEL_DATA | ASI_TAINT_USER_DATA | 
> > ASI_TAINT_OTHER_MM_DATA)
> > +
> > +#define ASI_TAINTS_GUEST_MASK (ASI_TAINT_GUEST_DATA | 
> > ASI_TAINT_GUEST_CONTROL)
> > +#define ASI_TAINTS_USER_MASK (ASI_TAINT_USER_DATA | ASI_TAINT_USER_CONTROL)
> > +
> > +/* The taint policy tells ASI how a class interacts with the CPU taints */
> > +struct asi_taint_policy {
> > + /*
> > +  * What taints would necessitate a flush when entering the domain, to
> > +  * protect it from attack by prior domains?
> > +  */
> > + asi_taints_t prevent_control;
>
> So if those necessitate a flush, why isn't this var called "taints_to_flush"
> or whatever which exactly explains what it is?

Well it needs to be disambiguated from the field below (currently
protect_data) but it could be control_to_flush (and data_to_flush).

The downside of that is that having one say "prevent" and one say
"protect" is quite meaningful. prevent_control is describing things we
need to do to protect the system from this domain, protect_data is
about protecting the domain from the system. However, while that
difference is meaningful it might not actually be helpful for the
reader of the code so I'm not wed to it.

Also worth noting that we could just combine these fields. At present
they should have disjoint bits set. But, they're used in separate
contexts and have separate (although conceptually very similar)
meanings, so I think that would reduce clarity.

>
> Spellchecker please. Go over your whole set.

Ack, I've set up a local thingy to spellcheck all my commits so
hopefully

Re: [PATCH RFC v2 02/29] x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION

2025-03-05 Thread Brendan Jackman

On Sat, Mar 01, 2025 at 09:23:51AM +0200, Mike Rapoport wrote:
> Hi Brendan,
>
> On Fri, Jan 10, 2025 at 06:40:28PM +, Brendan Jackman wrote:
> > Currently a nop config. Keeping as a separate commit for easy review of
> > the boring bits. Later commits will use and enable this new config.
> >
> > This config is only added for non-UML x86_64 as other architectures do
> > not yet have pending implementations. It also has somewhat artificial
> > dependencies on !PARAVIRT and !KASAN which are explained in the Kconfig
> > file.
> >
> > Co-developed-by: Junaid Shahid 
> > Signed-off-by: Junaid Shahid 
> > Signed-off-by: Brendan Jackman 
> > ---
> >  arch/alpha/include/asm/Kbuild  |  1 +
> >  arch/arc/include/asm/Kbuild|  1 +
> >  arch/arm/include/asm/Kbuild|  1 +
> >  arch/arm64/include/asm/Kbuild  |  1 +
> >  arch/csky/include/asm/Kbuild   |  1 +
> >  arch/hexagon/include/asm/Kbuild|  1 +
> >  arch/loongarch/include/asm/Kbuild  |  3 +++
> >  arch/m68k/include/asm/Kbuild   |  1 +
> >  arch/microblaze/include/asm/Kbuild |  1 +
> >  arch/mips/include/asm/Kbuild   |  1 +
> >  arch/nios2/include/asm/Kbuild  |  1 +
> >  arch/openrisc/include/asm/Kbuild   |  1 +
> >  arch/parisc/include/asm/Kbuild |  1 +
> >  arch/powerpc/include/asm/Kbuild|  1 +
> >  arch/riscv/include/asm/Kbuild  |  1 +
> >  arch/s390/include/asm/Kbuild   |  1 +
> >  arch/sh/include/asm/Kbuild |  1 +
> >  arch/sparc/include/asm/Kbuild  |  1 +
> >  arch/um/include/asm/Kbuild |  2 +-
> >  arch/x86/Kconfig   | 14 ++
> >  arch/xtensa/include/asm/Kbuild |  1 +
> >  include/asm-generic/asi.h  |  5 +
> >  22 files changed, 41 insertions(+), 1 deletion(-)
>
> I don't think this all is needed. You can put asi.h with stubs used outside
> of arch/x86 in include/linux and save you the hassle of updating every
> architecture.

...

> If you expect other architectures might implement ASI the config would better
> fit into init/Kconfig or mm/Kconfig and in arch/x86/Kconfig will define
> ARCH_HAS_MITIGATION_ADDRESS_SPACE_ISOLATION.

...

> > +++ b/include/asm-generic/asi.h
> > @@ -0,0 +1,5 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef __ASM_GENERIC_ASI_H
> > +#define __ASM_GENERIC_ASI_H
> > +
> > +#endif
>
> IMHO it should be include/linux/asi.h, with something like
>
> #infdef __LINUX_ASI_H
> #define __LINUX_ASI_H
>
> #ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
>
> #include 
>
> #else /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
>
> /* stubs for functions used outside arch/ */
>
> #endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
>
> #endif /* __LINUX_ASI_H */

Thanks Mike! That does indeed look way tidier. I'll try to adopt it.

[PATCH RFC v2 01/29] mm: asi: Make some utility functions noinstr compatible

[PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

[PATCH RFC v2 05/29] mm: asi: ASI support in interrupts/exceptions

[PATCH RFC v2 00/29] Address Space Isolation (ASI)

[PATCH RFC v2 06/29] mm: asi: Use separate PCIDs for restricted address spaces

[PATCH RFC v2 07/29] mm: asi: Make __get_current_cr3_fast() ASI-aware

[PATCH RFC v2 10/29] mm: asi: asi_exit() on PF, skip handling if address is accessible

[PATCH RFC v2 08/29] mm: asi: Avoid warning from NMI userspace accesses in ASI context

[PATCH RFC v2 09/29] mm: asi: ASI page table allocation functions

[PATCH RFC v2 02/29] x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION

[PATCH RFC v2 04/29] mm: asi: Add infrastructure for boot-time enablement

[PATCH RFC v2 14/29] mm: asi: Map non-user buddy allocations as nonsensitive

[PATCH TEMP WORKAROUND RFC v2 15/29] mm: asi: Workaround missing partial-unmap support

[PATCH RFC v2 11/29] mm: asi: Functions to map/unmap a memory range into ASI page tables

[PATCH RFC v2 16/29] mm: asi: Map kernel text and static data as nonsensitive

[PATCH RFC v2 17/29] mm: asi: Map vmalloc/vmap data as nonsensitive

[PATCH RFC v2 12/29] mm: asi: Add basic infrastructure for global non-sensitive mappings

[PATCH RFC v2 13/29] mm: Add __PAGEFLAG_FALSE

[PATCH RFC v2 20/29] mm: asi: Make TLB flushing correct under ASI

[PATCH RFC v2 18/29] mm: asi: Map dynamic percpu memory as nonsensitive

[PATCH RFC v2 22/29] mm: asi: exit ASI before accessing CR3 from C code where appropriate

[PATCH RFC v2 23/29] mm: asi: exit ASI before suspend-like operations

[PATCH RFC v2 21/29] KVM: x86: asi: Restricted address space for VM execution

[PATCH RFC v2 24/29] mm: asi: Add infrastructure for mapping userspace addresses

[PATCH RFC v2 25/29] mm: asi: Restricted execution fore bare-metal processes

[PATCH RFC v2 26/29] x86: Create library for flushing L1D for L1TF

[PATCH RFC v2 28/29] x86/pti: Disable PTI when ASI is on

[PATCH RFC v2 19/29] mm: asi: Stabilize CR3 in switch_mm_irqs_off()

[PATCH RFC v2 29/29] mm: asi: Stop ignoring asi=on cmdline flag

[PATCH RFC v2 27/29] mm: asi: Add some mitigations on address space transitions

Re: [PATCH RFC v2 01/29] mm: asi: Make some utility functions noinstr compatible

Re: [PATCH RFC v2 16/29] mm: asi: Map kernel text and static data as nonsensitive

Re: [PATCH RFC v2 04/29] mm: asi: Add infrastructure for boot-time enablement

Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

Re: [PATCH RFC v2 02/29] x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION

36 matches

Site Navigation

Mail list logo

Footer information