date:20231205

[PATCH v13 00/35] x86: enable FRED for x86-64

2023-12-05 Thread Xin Li

This patch set enables the Intel flexible return and event delivery
(FRED) architecture for x86-64.

The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:

1) Improve overall performance and response time by replacing event
   delivery through the interrupt descriptor table (IDT event
   delivery) and event return by the IRET instruction with lower
   latency transitions.

2) Improve software robustness by ensuring that event delivery
   establishes the full supervisor context and that event return
   establishes the full user context.

The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.

Search for the latest FRED spec in most search engines with this search pattern:

  site:intel.com FRED (flexible return and event delivery) specification

As of now there is no publicly avaiable CPU supporting FRED, thus the Intel
Simics® Simulator is used as software development and testing vehicles. And
it can be downloaded from:
  
https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html

To enable FRED, the Simics package 8112 QSP-CPU needs to be installed with
CPU model configured as:
$cpu_comp_class = "x86-experimental-fred"


Changes since v12:
* Merge the 3 WRMSRNS patches into one (Borislav Petkov).
* s/cpu/CPU/g (Borislav Petkov).
* Shorten the WRMSRNS description (Borislav Petkov).
* Put comments ontop, not on the side (Borislav Petkov).
* Use the ASCII char ' (char number 0x27), instead of its unicode char
  (Borislav Petkov).
* No "we" in a commit message, use passive voice (Borislav Petkov).
* Fix confusing Signed-off-by chains (Borislav Petkov).

Changes since v11:
* Add a new structure fred_cs to denote the FRED flags above CS
  selector as what is done for SS (H. Peter Anvin).

Changes since v10:
* No need to invalidate SYSCALL and SYSENTER MSRs (Thomas Gleixner).
* Better explain the reason why no need to check current stack level
  (Paolo Bonzini).
* Replace "IS_ENABLED(CONFIG_IA32_EMULATION)" with the new ia32_enabled()
  API (Nikolay Borisov).
* FRED feature is defined in cpuid word 12, not 13 (Nikolay Borisov).
* Reword a sentence in the new FRED documentation to improve readability
  (Nikolay Borisov).
* A few comment fixes and improvements to event type definitions
  (Andrew Cooper).

Changes since v9:
* Set unused sysvec table entries to fred_handle_spurious_interrupt()
  in fred_complete_exception_setup() (Thomas Gleixner).
* Shove the whole thing into arch/x86/entry/entry_64_fred.S for invoking
  external_interrupt() and fred_exc_nmi() (Sean Christopherson).
* Correct and improve a few comments (Sean Christopherson).
* Merge the two IRQ/NMI asm entries into one as it's fine to invoke
  noinstr code from regular code (Thomas Gleixner).
* Setup the long mode and NMI flags in the augmented SS field of FRED
  stack frame in C instead of asm (Thomas Gleixner).
* Don't use jump tables, indirect jumps are expensive (Thomas Gleixner).
* Except #NMI/#DB/#MCE, FRED really can share the exception handlers
  with IDT (Thomas Gleixner).
* Avoid the sysvec_* idt_entry muck, do it at a central place, reuse code
  instead of blindly copying it, which breaks the performance optimized
  sysvec entries like reschedule_ipi (Thomas Gleixner).
* Add asm_ prefix to FRED asm entry points (Thomas Gleixner).
* Disable #DB to avoid endless recursion and stack overflow when a
  watchpoint/breakpoint is set in the code path which is executed by
  #DB handler (Thomas Gleixner).
* Introduce a new structure fred_ss to denote the FRED flags above SS
  selector, which avoids FRED_SSX_ macros and makes the code simpler
  and easier to read (Thomas Gleixner).
* Use type u64 to define FRED bit fields instead of type unsigned int
  (Thomas Gleixner).
* Avoid a type cast by defining X86_CR4_FRED as 0 on 32-bit (Thomas
  Gleixner).
* Add the WRMSRNS instruction support (Thomas Gleixner).

Changes since v8:
* Move the FRED initialization patch after all required changes are in
  place (Thomas Gleixner).
* Don't do syscall early out in fred_entry_from_user() before there are
  proper performance numbers and justifications (Thomas Gleixner).
* Add the control exception handler to the FRED exception handler table
  (Thomas Gleixner).
* Introduce a macro sysvec_install() to derive the asm handler name from
  a C handler, which simplifies the code and avoids an ugly typecast
  (Thomas Gleixner).
* Remove junk code that assumes no local APIC on x86_64 (Thomas Gleixner).
* Put I

[PATCH v13 01/35] x86/cpufeatures,opcode,msr: Add the WRMSRNS instruction support

2023-12-05 Thread Xin Li

WRMSRNS is an instruction that behaves exactly like WRMSR, with
the only difference being that it is not a serializing instruction
by default. Under certain conditions, WRMSRNS may replace WRMSR to
improve performance.

Add its CPU feature bit, opcode to the x86 opcode map, and an
always inline API __wrmsrns() to embed WRMSRNS into the code.

Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v12:
* Merge the 3 WRMSRNS patches into one (Borislav Petkov).
* s/cpu/CPU/g (Borislav Petkov).
* Shorten the WRMSRNS description (Borislav Petkov).
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/msr.h   | 18 ++
 arch/x86/lib/x86-opcode-map.txt  |  2 +-
 tools/arch/x86/include/asm/cpufeatures.h |  1 +
 tools/arch/x86/lib/x86-opcode-map.txt|  2 +-
 5 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 149cc5d5c2ae..a903fc130e49 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -325,6 +325,7 @@
 #define X86_FEATURE_FSRS   (12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC   (12*32+12) /* "" Fast short REP 
{CMPSB,SCASB} */
 #define X86_FEATURE_LKGS   (12*32+18) /* "" Load "kernel" 
(userspace) GS */
+#define X86_FEATURE_WRMSRNS(12*32+19) /* "" Non-serializing WRMSR 
*/
 #define X86_FEATURE_AMX_FP16   (12*32+21) /* "" AMX fp16 Support */
 #define X86_FEATURE_AVX_IFMA(12*32+23) /* "" Support for 
VPMADD52[H,L]UQ */
 #define X86_FEATURE_LAM(12*32+26) /* Linear Address 
Masking */
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..c284ff9ebe67 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 
low, u32 high)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
 }
 
+/*
+ * WRMSRNS behaves exactly like WRMSR with the only difference being
+ * that it is not a serializing instruction by default.
+ */
+static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
+{
+   /* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
+   asm volatile("1: .byte 0x0f,0x01,0xc6\n"
+"2:\n"
+_ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
+: : "c" (msr), "a"(low), "d" (high));
+}
+
 #define native_rdmsr(msr, val1, val2)  \
 do {   \
u64 __val = __rdmsr((msr)); \
@@ -297,6 +310,11 @@ do {   
\
 
 #endif /* !CONFIG_PARAVIRT_XXL */
 
+static __always_inline void wrmsrns(u32 msr, u64 val)
+{
+   __wrmsrns(msr, val, val >> 32);
+}
+
 /*
  * 64-bit version of wrmsr_safe():
  */
diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..1efe1d9bf5ce 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -1051,7 +1051,7 @@ GrpTable: Grp6
 EndTable
 
 GrpTable: Grp7
-0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) 
| VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
+0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) 
| VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS 
(110),(11B)
 1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC 
(011),(11B) | ENCLS (111),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | 
XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
diff --git a/tools/arch/x86/include/asm/cpufeatures.h 
b/tools/arch/x86/include/asm/cpufeatures.h
index 4af140cf5719..26a73ae18a86 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -322,6 +322,7 @@
 #define X86_FEATURE_FSRS   (12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC   (12*32+12) /* "" Fast short REP 
{CMPSB,SCASB} */
 #define X86_FEATURE_LKGS   (12*32+18) /* "" Load "kernel" 
(userspace) GS */
+#define X86_FEATURE_WRMSRNS(12*32+19) /* "" Non-serializing WRMSR 
*/
 #define X86_FEATURE_AMX_FP16   (12*32+21) /* "" AMX fp16 Support */
 #define X86_FEATURE_AVX_IFMA(12*32+23) /* "" Support for 
VPMADD52[H,L]UQ */
 #define X86_FEATURE_LAM(12*32+26) /* Linear Address 
Masking */
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt 
b/tools/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..1efe1d9bf5ce 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -1051,7 +1051,7 @@ GrpTable: Grp6
 EndTable
 
 GrpTable: Grp7
-0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B)

[PATCH v13 02/35] x86/entry: Remove idtentry_sysvec from entry_{32,64}.S

2023-12-05 Thread Xin Li

idtentry_sysvec is really just DECLARE_IDTENTRY defined in
, no need to define it separately.

Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/entry/entry_32.S   | 4 
 arch/x86/entry/entry_64.S   | 8 
 arch/x86/include/asm/idtentry.h | 2 +-
 3 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4e295798638b..1b0fe4b49ea0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -649,10 +649,6 @@ SYM_CODE_START_LOCAL(asm_\cfunc)
 SYM_CODE_END(asm_\cfunc)
 .endm
 
-.macro idtentry_sysvec vector cfunc
-   idtentry \vector asm_\cfunc \cfunc has_error_code=0
-.endm
-
 /*
  * Include the defines which emit the idt entries which are shared
  * shared between 32 and 64 bit and emit the __irqentry_text_* markers
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 567d973eed03..5a1660701623 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -370,14 +370,6 @@ SYM_CODE_END(\asmsym)
idtentry \vector asm_\cfunc \cfunc has_error_code=1
 .endm
 
-/*
- * System vectors which invoke their handlers directly and are not
- * going through the regular common device interrupt handling code.
- */
-.macro idtentry_sysvec vector cfunc
-   idtentry \vector asm_\cfunc \cfunc has_error_code=0
-.endm
-
 /**
  * idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
  * @vector:Vector number
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 05fd175cec7d..cfca68f6cb84 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -447,7 +447,7 @@ __visible noinstr void func(struct pt_regs *regs,   
\
 
 /* System vector entries */
 #define DECLARE_IDTENTRY_SYSVEC(vector, func)  \
-   idtentry_sysvec vector func
+   DECLARE_IDTENTRY(vector, func)
 
 #ifdef CONFIG_X86_64
 # define DECLARE_IDTENTRY_MCE(vector, func)\
-- 
2.43.0

[PATCH v13 03/35] x86/trapnr: Add event type macros to

2023-12-05 Thread Xin Li

Intel VT-x classifies events into eight different types, which is
inherited by FRED for event identification. As such, event type
becomes a common x86 concept, and should be defined in a common x86
header.

Add event type macros to , and use it in .

Suggested-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v10:
* A few comment fixes and improvements (Andrew Cooper).
---
 arch/x86/include/asm/trapnr.h | 12 
 arch/x86/include/asm/vmx.h| 17 +
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/trapnr.h b/arch/x86/include/asm/trapnr.h
index f5d2325aa0b7..8d1154cdf787 100644
--- a/arch/x86/include/asm/trapnr.h
+++ b/arch/x86/include/asm/trapnr.h
@@ -2,6 +2,18 @@
 #ifndef _ASM_X86_TRAPNR_H
 #define _ASM_X86_TRAPNR_H
 
+/*
+ * Event type codes used by FRED, Intel VT-x and AMD SVM
+ */
+#define EVENT_TYPE_EXTINT  0   // External interrupt
+#define EVENT_TYPE_RESERVED1
+#define EVENT_TYPE_NMI 2   // NMI
+#define EVENT_TYPE_HWEXC   3   // Hardware originated traps, exceptions
+#define EVENT_TYPE_SWINT   4   // INT n
+#define EVENT_TYPE_PRIV_SWEXC  5   // INT1
+#define EVENT_TYPE_SWEXC   6   // INTO, INT3
+#define EVENT_TYPE_OTHER   7   // FRED SYSCALL/SYSENTER, VT-x MTF
+
 /* Interrupts/Exceptions */
 
 #define X86_TRAP_DE 0  /* Divide-by-zero */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..4dba17363008 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -17,6 +17,7 @@
 #include 
 
 #include 
+#include 
 #include 
 
 #define VMCS_CONTROL_BIT(x)BIT(VMX_FEATURE_##x & 0x1f)
@@ -374,14 +375,14 @@ enum vmcs_field {
 #define VECTORING_INFO_DELIVER_CODE_MASK   INTR_INFO_DELIVER_CODE_MASK
 #define VECTORING_INFO_VALID_MASK  INTR_INFO_VALID_MASK
 
-#define INTR_TYPE_EXT_INTR  (0 << 8) /* external interrupt */
-#define INTR_TYPE_RESERVED  (1 << 8) /* reserved */
-#define INTR_TYPE_NMI_INTR (2 << 8) /* NMI */
-#define INTR_TYPE_HARD_EXCEPTION   (3 << 8) /* processor exception */
-#define INTR_TYPE_SOFT_INTR (4 << 8) /* software interrupt */
-#define INTR_TYPE_PRIV_SW_EXCEPTION(5 << 8) /* ICE breakpoint - 
undocumented */
-#define INTR_TYPE_SOFT_EXCEPTION   (6 << 8) /* software exception */
-#define INTR_TYPE_OTHER_EVENT   (7 << 8) /* other event */
+#define INTR_TYPE_EXT_INTR (EVENT_TYPE_EXTINT << 8)/* 
external interrupt */
+#define INTR_TYPE_RESERVED (EVENT_TYPE_RESERVED << 8)  /* 
reserved */
+#define INTR_TYPE_NMI_INTR (EVENT_TYPE_NMI << 8)   /* NMI 
*/
+#define INTR_TYPE_HARD_EXCEPTION   (EVENT_TYPE_HWEXC << 8) /* 
processor exception */
+#define INTR_TYPE_SOFT_INTR(EVENT_TYPE_SWINT << 8) /* 
software interrupt */
+#define INTR_TYPE_PRIV_SW_EXCEPTION(EVENT_TYPE_PRIV_SWEXC << 8)/* ICE 
breakpoint */
+#define INTR_TYPE_SOFT_EXCEPTION   (EVENT_TYPE_SWEXC << 8) /* 
software exception */
+#define INTR_TYPE_OTHER_EVENT  (EVENT_TYPE_OTHER << 8) /* 
other event */
 
 /* GUEST_INTERRUPTIBILITY_INFO flags. */
 #define GUEST_INTR_STATE_STI   0x0001
-- 
2.43.0

[PATCH v13 04/35] Documentation/x86/64: Add a documentation for FRED

2023-12-05 Thread Xin Li

Briefly introduce FRED, and its advantages compared to IDT.

Reviewed-by: Bagas Sanjaya 
Signed-off-by: Xin Li 
---

Changes since v10:
* Reword a sentence to improve readability (Nikolay Borisov).
---
 Documentation/arch/x86/x86_64/fred.rst  | 96 +
 Documentation/arch/x86/x86_64/index.rst |  1 +
 2 files changed, 97 insertions(+)
 create mode 100644 Documentation/arch/x86/x86_64/fred.rst

diff --git a/Documentation/arch/x86/x86_64/fred.rst 
b/Documentation/arch/x86/x86_64/fred.rst
new file mode 100644
index ..9f57e7b91f7e
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/fred.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Flexible Return and Event Delivery (FRED)
+=
+
+Overview
+
+
+The FRED architecture defines simple new transitions that change
+privilege level (ring transitions). The FRED architecture was
+designed with the following goals:
+
+1) Improve overall performance and response time by replacing event
+   delivery through the interrupt descriptor table (IDT event
+   delivery) and event return by the IRET instruction with lower
+   latency transitions.
+
+2) Improve software robustness by ensuring that event delivery
+   establishes the full supervisor context and that event return
+   establishes the full user context.
+
+The new transitions defined by the FRED architecture are FRED event
+delivery and, for returning from events, two FRED return instructions.
+FRED event delivery can effect a transition from ring 3 to ring 0, but
+it is used also to deliver events incident to ring 0. One FRED
+instruction (ERETU) effects a return from ring 0 to ring 3, while the
+other (ERETS) returns while remaining in ring 0. Collectively, FRED
+event delivery and the FRED return instructions are FRED transitions.
+
+In addition to these transitions, the FRED architecture defines a new
+instruction (LKGS) for managing the state of the GS segment register.
+The LKGS instruction can be used by 64-bit operating systems that do
+not use the new FRED transitions.
+
+Furthermore, the FRED architecture is easy to extend for future CPU
+architectures.
+
+Software based event dispatching
+
+
+FRED operates differently from IDT in terms of event handling. Instead
+of directly dispatching an event to its handler based on the event
+vector, FRED requires the software to dispatch an event to its handler
+based on both the event's type and vector. Therefore, an event dispatch
+framework must be implemented to facilitate the event-to-handler
+dispatch process. The FRED event dispatch framework takes control
+once an event is delivered, and employs a two-level dispatch.
+
+The first level dispatching is event type based, and the second level
+dispatching is event vector based.
+
+Full supervisor/user context
+
+
+FRED event delivery atomically save and restore full supervisor/user
+context upon event delivery and return. Thus it avoids the problem of
+transient states due to %cr2 and/or %dr6, and it is no longer needed
+to handle all the ugly corner cases caused by half baked entry states.
+
+FRED allows explicit unblock of NMI with new event return instructions
+ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
+unblocks NMI, e.g., when an exception happens during NMI handling.
+
+FRED always restores the full value of %rsp, thus ESPFIX is no longer
+needed when FRED is enabled.
+
+LKGS
+
+
+LKGS behaves like the MOV to GS instruction except that it loads the
+base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
+segment’s descriptor cache. With LKGS, it ends up with avoiding
+mucking with kernel GS, i.e., an operating system can always operate
+with its own GS base address.
+
+Because FRED event delivery from ring 3 and ERETU both swap the value
+of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
+the introduction of LKGS instruction, the SWAPGS instruction is no
+longer needed when FRED is enabled, thus is disallowed (#UD).
+
+Stack levels
+
+
+4 stack levels 0~3 are introduced to replace the nonreentrant IST for
+event handling, and each stack level should be configured to use a
+dedicated stack.
+
+The current stack level could be unchanged or go higher upon FRED
+event delivery. If unchanged, the CPU keeps using the current event
+stack. If higher, the CPU switches to a new event stack specified by
+the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].
+
+Only execution of a FRED return instruction ERET[US], could lower the
+current stack level, causing the CPU to switch back to the stack it was
+on before a previous event delivery that promoted the stack level.
diff --git a/Documentation/arch/x86/x86_64/index.rst 
b/Documentation/arch/x86/x86_64/index.rst
index a56070fc8e77..ad15e9bd623f 100644
--- a/Documentation/arch/x86/x86_64/index.rst
+++ b/Doc

[PATCH v13 05/35] x86/fred: Add Kconfig option for FRED (CONFIG_X86_FRED)

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Add the configuration option CONFIG_X86_FRED to enable FRED.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/Kconfig | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c456c9b1fc7c..ec923d4055c5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -492,6 +492,15 @@ config X86_CPU_RESCTRL
 
  Say N if unsure.
 
+config X86_FRED
+   bool "Flexible Return and Event Delivery"
+   depends on X86_64
+   help
+ When enabled, try to use Flexible Return and Event Delivery
+ instead of the legacy SYSCALL/SYSENTER/IDT architecture for
+ ring transitions and exception/interrupt handling if the
+ system supports.
+
 if X86_32
 config X86_BIGSMP
bool "Support for big SMP systems with more than 8 CPUs"
-- 
2.43.0

[PATCH v13 07/35] x86/fred: Disable FRED support if CONFIG_X86_FRED is disabled

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Add CONFIG_X86_FRED to  to make
cpu_feature_enabled() work correctly with FRED.

Originally-by: Megha Dey 
Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v10:
* FRED feature is defined in cpuid word 12, not 13 (Nikolay Borisov).
---
 arch/x86/include/asm/disabled-features.h   | 8 +++-
 tools/arch/x86/include/asm/disabled-features.h | 8 +++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 702d93fdd10e..f40b29d3abad 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -117,6 +117,12 @@
 #define DISABLE_IBT(1 << (X86_FEATURE_IBT & 31))
 #endif
 
+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED  0
+#else
+# define DISABLE_FRED  (1 << (X86_FEATURE_FRED & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -133,7 +139,7 @@
 #define DISABLED_MASK100
 #define DISABLED_MASK11
(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
-#define DISABLED_MASK12(DISABLE_LAM)
+#define DISABLED_MASK12(DISABLE_FRED|DISABLE_LAM)
 #define DISABLED_MASK130
 #define DISABLED_MASK140
 #define DISABLED_MASK150
diff --git a/tools/arch/x86/include/asm/disabled-features.h 
b/tools/arch/x86/include/asm/disabled-features.h
index 702d93fdd10e..f40b29d3abad 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -117,6 +117,12 @@
 #define DISABLE_IBT(1 << (X86_FEATURE_IBT & 31))
 #endif
 
+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED  0
+#else
+# define DISABLE_FRED  (1 << (X86_FEATURE_FRED & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -133,7 +139,7 @@
 #define DISABLED_MASK100
 #define DISABLED_MASK11
(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
-#define DISABLED_MASK12(DISABLE_LAM)
+#define DISABLED_MASK12(DISABLE_FRED|DISABLE_LAM)
 #define DISABLED_MASK130
 #define DISABLED_MASK140
 #define DISABLED_MASK150
-- 
2.43.0

[PATCH v13 06/35] x86/cpufeatures: Add the CPU feature bit for FRED

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Any FRED CPU will always have the following features as its baseline:
  1) LKGS, load attributes of the GS segment but the base address into
 the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
 cache.
  2) WRMSRNS, non-serializing WRMSR for faster MSR writes.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Change since v12:
* s/cpu/CPU/g (Borislav Petkov).
---
 arch/x86/include/asm/cpufeatures.h   | 1 +
 arch/x86/kernel/cpu/cpuid-deps.c | 2 ++
 tools/arch/x86/include/asm/cpufeatures.h | 1 +
 3 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index a903fc130e49..fef95d190054 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -324,6 +324,7 @@
 #define X86_FEATURE_FZRM   (12*32+10) /* "" Fast zero-length REP 
MOVSB */
 #define X86_FEATURE_FSRS   (12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC   (12*32+12) /* "" Fast short REP 
{CMPSB,SCASB} */
+#define X86_FEATURE_FRED   (12*32+17) /* Flexible Return and Event 
Delivery */
 #define X86_FEATURE_LKGS   (12*32+18) /* "" Load "kernel" 
(userspace) GS */
 #define X86_FEATURE_WRMSRNS(12*32+19) /* "" Non-serializing WRMSR 
*/
 #define X86_FEATURE_AMX_FP16   (12*32+21) /* "" AMX fp16 Support */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index e462c1d3800a..b7174209d855 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -82,6 +82,8 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD,  X86_FEATURE_XGETBV1   },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD   },
{ X86_FEATURE_SHSTK,X86_FEATURE_XSAVES},
+   { X86_FEATURE_FRED, X86_FEATURE_LKGS  },
+   { X86_FEATURE_FRED, X86_FEATURE_WRMSRNS   },
{}
 };
 
diff --git a/tools/arch/x86/include/asm/cpufeatures.h 
b/tools/arch/x86/include/asm/cpufeatures.h
index 26a73ae18a86..f433e9f61354 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -321,6 +321,7 @@
 #define X86_FEATURE_FZRM   (12*32+10) /* "" Fast zero-length REP 
MOVSB */
 #define X86_FEATURE_FSRS   (12*32+11) /* "" Fast short REP STOSB */
 #define X86_FEATURE_FSRC   (12*32+12) /* "" Fast short REP 
{CMPSB,SCASB} */
+#define X86_FEATURE_FRED   (12*32+17) /* Flexible Return and Event 
Delivery */
 #define X86_FEATURE_LKGS   (12*32+18) /* "" Load "kernel" 
(userspace) GS */
 #define X86_FEATURE_WRMSRNS(12*32+19) /* "" Non-serializing WRMSR 
*/
 #define X86_FEATURE_AMX_FP16   (12*32+21) /* "" AMX fp16 Support */
-- 
2.43.0

[PATCH v13 08/35] x86/fred: Disable FRED by default in its early stage

2023-12-05 Thread Xin Li

To enable FRED, a new kernel command line option "fred" needs to be added.

Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 Documentation/admin-guide/kernel-parameters.txt | 3 +++
 arch/x86/kernel/cpu/common.c| 3 +++
 2 files changed, 6 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 65731b060e3f..6992b392e8d3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1526,6 +1526,9 @@
Warning: use of this parameter will taint the kernel
and may cause unknown problems.
 
+   fred[X86-64]
+   Enable flexible return and event delivery
+
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 4d4b87c6885d..68102acd63b0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1491,6 +1491,9 @@ static void __init cpu_parse_early_param(void)
char *argptr = arg, *opt;
int arglen, taint = 0;
 
+   if (!cmdline_find_option_bool(boot_command_line, "fred"))
+   setup_clear_cpu_cap(X86_FEATURE_FRED);
+
 #ifdef CONFIG_X86_32
if (cmdline_find_option_bool(boot_command_line, "no387"))
 #ifdef CONFIG_MATH_EMULATION
-- 
2.43.0

[PATCH v13 11/35] x86/cpu: Add X86_CR4_FRED macro

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be
changed after initialization, so add it to the pinned CR4 bits.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v9:
* Avoid a type cast by defining X86_CR4_FRED as 0 on 32-bit (Thomas
  Gleixner).
---
 arch/x86/include/uapi/asm/processor-flags.h | 7 +++
 arch/x86/kernel/cpu/common.c| 5 ++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index d898432947ff..f1a4adc78272 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -139,6 +139,13 @@
 #define X86_CR4_LAM_SUP_BIT28 /* LAM for supervisor pointers */
 #define X86_CR4_LAM_SUP_BITUL(X86_CR4_LAM_SUP_BIT)
 
+#ifdef __x86_64__
+#define X86_CR4_FRED_BIT   32 /* enable FRED kernel entry */
+#define X86_CR4_FRED   _BITUL(X86_CR4_FRED_BIT)
+#else
+#define X86_CR4_FRED   (0)
+#endif
+
 /*
  * x86-64 Task Priority Register, CR8
  */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 68102acd63b0..132f41f7c27f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -389,9 +389,8 @@ static __always_inline void setup_umip(struct cpuinfo_x86 
*c)
 }
 
 /* These bits should not change their value after CPU init is finished. */
-static const unsigned long cr4_pinned_mask =
-   X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
-   X86_CR4_FSGSBASE | X86_CR4_CET;
+static const unsigned long cr4_pinned_mask = X86_CR4_SMEP | X86_CR4_SMAP | 
X86_CR4_UMIP |
+X86_CR4_FSGSBASE | X86_CR4_CET | 
X86_CR4_FRED;
 static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
 static unsigned long cr4_pinned_bits __ro_after_init;
 
-- 
2.43.0

[PATCH v13 10/35] x86/objtool: Teach objtool about ERET[US]

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Update the objtool decoder to know about the ERET[US] instructions
(type INSN_CONTEXT_SWITCH).

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 tools/objtool/arch/x86/decode.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index e327cd827135..3a1d80a7878d 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -509,11 +509,20 @@ int arch_decode_instruction(struct objtool_file *file, 
const struct section *sec
 
if (op2 == 0x01) {
 
-   if (modrm == 0xca)
-   insn->type = INSN_CLAC;
-   else if (modrm == 0xcb)
-   insn->type = INSN_STAC;
-
+   switch (insn_last_prefix_id(&ins)) {
+   case INAT_PFX_REPE:
+   case INAT_PFX_REPNE:
+   if (modrm == 0xca)
+   /* eretu/erets */
+   insn->type = INSN_CONTEXT_SWITCH;
+   break;
+   default:
+   if (modrm == 0xca)
+   insn->type = INSN_CLAC;
+   else if (modrm == 0xcb)
+   insn->type = INSN_STAC;
+   break;
+   }
} else if (op2 >= 0x80 && op2 <= 0x8f) {
 
insn->type = INSN_JUMP_CONDITIONAL;
-- 
2.43.0

[PATCH v13 09/35] x86/opcode: Add ERET[US] instructions to the x86 opcode map

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

ERETU returns from an event handler while making a transition to ring 3,
and ERETS returns from an event handler while staying in ring 0.

Add instruction opcodes used by ERET[US] to the x86 opcode map; opcode
numbers are per FRED spec v5.0.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Xin Li 
---
 arch/x86/lib/x86-opcode-map.txt   | 2 +-
 tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 1efe1d9bf5ce..12af572201a2 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable
 
 GrpTable: Grp7
 0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) 
| VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS 
(110),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC 
(011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC 
(011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS 
(F2),(010),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | 
XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
 4: SMSW Mw/Rv
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt 
b/tools/arch/x86/lib/x86-opcode-map.txt
index 1efe1d9bf5ce..12af572201a2 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable
 
 GrpTable: Grp7
 0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) 
| VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B) | WRMSRNS 
(110),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC 
(011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC 
(011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS 
(F2),(010),(11B)
 2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | 
XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
 3: LIDT Ms
 4: SMSW Mw/Rv
-- 
2.43.0

[PATCH v13 12/35] x86/cpu: Add MSR numbers for FRED configuration

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Add MSR numbers for the FRED configuration registers per FRED spec 5.0.

Originally-by: Megha Dey 
Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/include/asm/msr-index.h   | 13 -
 tools/arch/x86/include/asm/msr-index.h | 13 -
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 737a52b89e64..d1d6b3c3e6bd 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
 #define EFER_FFXSR (1<<_EFER_FFXSR)
 #define EFER_AUTOIBRS  (1<<_EFER_AUTOIBRS)
 
-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0 0x1cc   /* Level 0 stack 
pointer */
+#define MSR_IA32_FRED_RSP1 0x1cd   /* Level 1 stack 
pointer */
+#define MSR_IA32_FRED_RSP2 0x1ce   /* Level 2 stack 
pointer */
+#define MSR_IA32_FRED_RSP3 0x1cf   /* Level 3 stack 
pointer */
+#define MSR_IA32_FRED_STKLVLS  0x1d0   /* Exception stack 
levels */
+#define MSR_IA32_FRED_SSP0 MSR_IA32_PL0_SSP/* Level 0 shadow stack 
pointer */
+#define MSR_IA32_FRED_SSP1 0x1d1   /* Level 1 shadow stack 
pointer */
+#define MSR_IA32_FRED_SSP2 0x1d2   /* Level 2 shadow stack 
pointer */
+#define MSR_IA32_FRED_SSP3 0x1d3   /* Level 3 shadow stack 
pointer */
+#define MSR_IA32_FRED_CONFIG   0x1d4   /* Entrypoint and 
interrupt stack level */
 
+/* Intel MSRs. Some also available on other CPUs */
 #define MSR_TEST_CTRL  0x0033
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT29
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT
BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
diff --git a/tools/arch/x86/include/asm/msr-index.h 
b/tools/arch/x86/include/asm/msr-index.h
index 1d51e1850ed0..74f2c63ce717 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
 #define EFER_FFXSR (1<<_EFER_FFXSR)
 #define EFER_AUTOIBRS  (1<<_EFER_AUTOIBRS)
 
-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0 0x1cc   /* Level 0 stack 
pointer */
+#define MSR_IA32_FRED_RSP1 0x1cd   /* Level 1 stack 
pointer */
+#define MSR_IA32_FRED_RSP2 0x1ce   /* Level 2 stack 
pointer */
+#define MSR_IA32_FRED_RSP3 0x1cf   /* Level 3 stack 
pointer */
+#define MSR_IA32_FRED_STKLVLS  0x1d0   /* Exception stack 
levels */
+#define MSR_IA32_FRED_SSP0 MSR_IA32_PL0_SSP/* Level 0 shadow stack 
pointer */
+#define MSR_IA32_FRED_SSP1 0x1d1   /* Level 1 shadow stack 
pointer */
+#define MSR_IA32_FRED_SSP2 0x1d2   /* Level 2 shadow stack 
pointer */
+#define MSR_IA32_FRED_SSP3 0x1d3   /* Level 3 shadow stack 
pointer */
+#define MSR_IA32_FRED_CONFIG   0x1d4   /* Entrypoint and 
interrupt stack level */
 
+/* Intel MSRs. Some also available on other CPUs */
 #define MSR_TEST_CTRL  0x0033
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT29
 #define MSR_TEST_CTRL_SPLIT_LOCK_DETECT
BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
-- 
2.43.0

[PATCH v13 13/35] x86/ptrace: Cleanup the definition of the pt_regs structure

2023-12-05 Thread Xin Li

struct pt_regs is hard to read because the member or section related
comments are not aligned with the members.

The 'cs' and 'ss' members of pt_regs are type of 'unsigned long' while
in reality they are only 16-bit wide. This works so far as the
remaining space is unused, but FRED will use the remaining bits for
other purposes.

To prepare for FRED:

  - Cleanup the formatting
  - Convert 'cs' and 'ss' to u16 and embed them into an union
with a u64
  - Fixup the related printk() format strings

Originally-by: H. Peter Anvin (Intel) 
Suggested-by: Thomas Gleixner 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Change since v12:
* Put comments ontop, not on the side (Borislav Petkov).
---
 arch/x86/entry/vsyscall/vsyscall_64.c |  2 +-
 arch/x86/include/asm/ptrace.h | 48 +++
 arch/x86/kernel/process_64.c  |  2 +-
 3 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c 
b/arch/x86/entry/vsyscall/vsyscall_64.c
index e0ca8120aea8..a3c0df11d0e6 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -76,7 +76,7 @@ static void warn_bad_vsyscall(const char *level, struct 
pt_regs *regs,
if (!show_unhandled_signals)
return;
 
-   printk_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx 
di:%lx\n",
+   printk_ratelimited("%s%s[%d] %s ip:%lx cs:%x sp:%lx ax:%lx si:%lx 
di:%lx\n",
   level, current->comm, task_pid_nr(current),
   message, regs->ip, regs->cs,
   regs->sp, regs->ax, regs->si, regs->di);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index f4db78b09c8f..b268cd2a2d01 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -57,17 +57,19 @@ struct pt_regs {
 #else /* __i386__ */
 
 struct pt_regs {
-/*
- * C ABI says these regs are callee-preserved. They aren't saved on kernel 
entry
- * unless syscall needs a complete, fully filled "struct pt_regs".
- */
+   /*
+* C ABI says these regs are callee-preserved. They aren't saved on
+* kernel entry unless syscall needs a complete, fully filled
+* "struct pt_regs".
+*/
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long bp;
unsigned long bx;
-/* These regs are callee-clobbered. Always saved on kernel entry. */
+
+   /* These regs are callee-clobbered. Always saved on kernel entry. */
unsigned long r11;
unsigned long r10;
unsigned long r9;
@@ -77,18 +79,38 @@ struct pt_regs {
unsigned long dx;
unsigned long si;
unsigned long di;
-/*
- * On syscall entry, this is syscall#. On CPU exception, this is error code.
- * On hw interrupt, it's IRQ number:
- */
+
+   /*
+* orig_ax is used on entry for:
+* - the syscall number (syscall, sysenter, int80)
+* - error_code stored by the CPU on traps and exceptions
+* - the interrupt number for device interrupts
+*/
unsigned long orig_ax;
-/* Return frame for iretq */
+
+   /* The IRETQ return frame starts here */
unsigned long ip;
-   unsigned long cs;
+
+   union {
+   /* The full 64-bit data slot containing CS */
+   u64 csx;
+   /* CS selector */
+   u16 cs;
+   };
+
unsigned long flags;
unsigned long sp;
-   unsigned long ss;
-/* top of stack page */
+
+   union {
+   /* The full 64-bit data slot containing SS */
+   u64 ssx;
+   /* SS selector */
+   u16 ss;
+   };
+
+   /*
+* Top of stack on IDT systems.
+*/
 };
 
 #endif /* !__i386__ */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1553e19904e0..b924477c5ba8 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -117,7 +117,7 @@ void __show_regs(struct pt_regs *regs, enum show_regs_mode 
mode,
 
printk("%sFS:  %016lx(%04x) GS:%016lx(%04x) knlGS:%016lx\n",
   log_lvl, fs, fsindex, gs, gsindex, shadowgs);
-   printk("%sCS:  %04lx DS: %04x ES: %04x CR0: %016lx\n",
+   printk("%sCS:  %04x DS: %04x ES: %04x CR0: %016lx\n",
log_lvl, regs->cs, ds, es, cr0);
printk("%sCR2: %016lx CR3: %016lx CR4: %016lx\n",
log_lvl, cr2, cr3, cr4);
-- 
2.43.0

[PATCH v13 14/35] x86/ptrace: Add FRED additional information to the pt_regs structure

2023-12-05 Thread Xin Li

FRED defines additional information in the upper 48 bits of cs/ss
fields. Therefore add the information definitions into the pt_regs
structure.

Specially introduce a new structure fred_ss to denote the FRED flags
above SS selector, which avoids FRED_SSX_ macros and makes the code
simpler and easier to read.

Originally-by: H. Peter Anvin (Intel) 
Suggested-by: Thomas Gleixner 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Change since v11:
* Add a new structure fred_cs to denote the FRED flags above CS
  selector as what is done for SS (H. Peter Anvin).

Changes since v9:
* Introduce a new structure fred_ss to denote the FRED flags above SS
  selector, which avoids FRED_SSX_ macros and makes the code simpler
  and easier to read (Thomas Gleixner).
* Use type u64 to define FRED bit fields instead of type unsigned int
  (Thomas Gleixner).

Changes since v8:
* Reflect stack frame definition changes from FRED spec 3.0 to 5.0.
* Use __packed instead of __attribute__((__packed__)) (Borislav Petkov).
* Put all comments above the members, like the rest of the file does
  (Borislav Petkov).

Changes since v3:
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
  (Andrew Cooper).
---
 arch/x86/include/asm/ptrace.h | 66 ---
 1 file changed, 61 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index b268cd2a2d01..5a83fbd9bc0b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -56,6 +56,50 @@ struct pt_regs {
 
 #else /* __i386__ */
 
+struct fred_cs {
+   /* CS selector */
+   u64 cs  : 16,
+   /* Stack level at event time */
+   sl  :  2,
+   /* IBT in WAIT_FOR_ENDBRANCH state */
+   wfe :  1,
+   : 45;
+};
+
+struct fred_ss {
+   /* SS selector */
+   u64 ss  : 16,
+   /* STI state */
+   sti :  1,
+   /* Set if syscall, sysenter or INT n */
+   swevent :  1,
+   /* Event is NMI type */
+   nmi :  1,
+   : 13,
+   /* Event vector */
+   vector  :  8,
+   :  8,
+   /* Event type */
+   type:  4,
+   :  4,
+   /* Event was incident to enclave execution */
+   enclave :  1,
+   /* CPU was in long mode */
+   lm  :  1,
+   /*
+* Nested exception during FRED delivery, not set
+* for #DF.
+*/
+   nested  :  1,
+   :  1,
+   /*
+* The length of the instruction causing the event.
+* Only set for INTO, INT1, INT3, INT n, SYSCALL
+* and SYSENTER.  0 otherwise.
+*/
+   insnlen :  4;
+};
+
 struct pt_regs {
/*
 * C ABI says these regs are callee-preserved. They aren't saved on
@@ -85,6 +129,12 @@ struct pt_regs {
 * - the syscall number (syscall, sysenter, int80)
 * - error_code stored by the CPU on traps and exceptions
 * - the interrupt number for device interrupts
+*
+* A FRED stack frame starts here:
+*   1) It _always_ includes an error code;
+*
+*   2) The return frame for ERET[US] starts here, but
+*  the content of orig_ax is ignored.
 */
unsigned long orig_ax;
 
@@ -92,24 +142,30 @@ struct pt_regs {
unsigned long ip;
 
union {
-   /* The full 64-bit data slot containing CS */
-   u64 csx;
/* CS selector */
u16 cs;
+   /* The extended 64-bit data slot containing CS */
+   u64 csx;
+   /* The FRED CS extension */
+   struct fred_cs  fred_cs;
};
 
unsigned long flags;
unsigned long sp;
 
union {
-   /* The full 64-bit data slot containing SS */
-   u64 ssx;
/* SS selector */
u16 ss;
+   /* The extended 64-bit data slot containing SS */
+   u64 ssx;
+   /* The FRED SS extension */
+   struct fred_ss  fred_ss;
};
 
/*
-* Top of stack on IDT systems.
+* Top of stack on IDT systems, while FRED systems have extra fields
+* defined above for storing exception related information, e.g. CR2 or
+* DR6.
 */
 };
 
-- 
2.43.0

[PATCH v13 15/35] x86/fred: Add a new header file for FRED definitions

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Add a header file for FRED prototypes and definitions.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v6:
* Replace pt_regs csx flags prefix FRED_CSL_ with FRED_CSX_.
---
 arch/x86/include/asm/fred.h | 68 +
 1 file changed, 68 insertions(+)
 create mode 100644 arch/x86/include/asm/fred.h

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
new file mode 100644
index ..f514fdb5a39f
--- /dev/null
+++ b/arch/x86/include/asm/fred.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Macros for Flexible Return and Event Delivery (FRED)
+ */
+
+#ifndef ASM_X86_FRED_H
+#define ASM_X86_FRED_H
+
+#include 
+
+#include 
+
+/*
+ * FRED event return instruction opcodes for ERET{S,U}; supported in
+ * binutils >= 2.41.
+ */
+#define ERETS  _ASM_BYTES(0xf2,0x0f,0x01,0xca)
+#define ERETU  _ASM_BYTES(0xf3,0x0f,0x01,0xca)
+
+/*
+ * RSP is aligned to a 64-byte boundary before used to push a new stack frame
+ */
+#define FRED_STACK_FRAME_RSP_MASK  _AT(unsigned long, (~0x3f))
+
+/*
+ * Used for the return address for call emulation during code patching,
+ * and measured in 64-byte cache lines.
+ */
+#define FRED_CONFIG_REDZONE_AMOUNT 1
+#define FRED_CONFIG_REDZONE(_AT(unsigned long, 
FRED_CONFIG_REDZONE_AMOUNT) << 6)
+#define FRED_CONFIG_INT_STKLVL(l)  (_AT(unsigned long, l) << 9)
+#define FRED_CONFIG_ENTRYPOINT(p)  _AT(unsigned long, (p))
+
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_X86_FRED
+#include 
+
+#include 
+
+struct fred_info {
+   /* Event data: CR2, DR6, ... */
+   unsigned long edata;
+   unsigned long resv;
+};
+
+/* Full format of the FRED stack frame */
+struct fred_frame {
+   struct pt_regs   regs;
+   struct fred_info info;
+};
+
+static __always_inline struct fred_info *fred_info(struct pt_regs *regs)
+{
+   return &container_of(regs, struct fred_frame, regs)->info;
+}
+
+static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
+{
+   return fred_info(regs)->edata;
+}
+
+#else /* CONFIG_X86_FRED */
+static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { 
return 0; }
+#endif /* CONFIG_X86_FRED */
+#endif /* !__ASSEMBLY__ */
+
+#endif /* ASM_X86_FRED_H */
-- 
2.43.0

[PATCH v13 17/35] x86/fred: Update MSR_IA32_FRED_RSP0 during task switch

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

MSR_IA32_FRED_RSP0 is used during ring 3 event delivery, and needs to
be updated to point to the top of next task stack during task switch.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/include/asm/switch_to.h | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index f42dbf17f52b..c3bd0c0758c9 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -70,9 +70,13 @@ static inline void update_task_stack(struct task_struct 
*task)
 #ifdef CONFIG_X86_32
this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
 #else
-   /* Xen PV enters the kernel on the thread stack. */
-   if (cpu_feature_enabled(X86_FEATURE_XENPV))
+   if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+   /* WRMSRNS is a baseline feature for FRED. */
+   wrmsrns(MSR_IA32_FRED_RSP0, (unsigned 
long)task_stack_page(task) + THREAD_SIZE);
+   } else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+   /* Xen PV enters the kernel on the thread stack. */
load_sp0(task_top_of_stack(task));
+   }
 #endif
 }
 
-- 
2.43.0

[PATCH v13 16/35] x86/fred: Reserve space for the FRED stack frame

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

When using FRED, reserve space at the top of the stack frame, just
like i386 does.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/include/asm/thread_info.h | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index d63b02940747..12da7dfd5ef1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -31,7 +31,9 @@
  * In vm86 mode, the hardware frame is much longer still, so add 16
  * bytes to make room for the real-mode segments.
  *
- * x86_64 has a fixed-length stack frame.
+ * x86-64 has a fixed-length stack frame, but it depends on whether
+ * or not FRED is enabled. Future versions of FRED might make this
+ * dynamic, but for now it is always 2 words longer.
  */
 #ifdef CONFIG_X86_32
 # ifdef CONFIG_VM86
@@ -39,8 +41,12 @@
 # else
 #  define TOP_OF_KERNEL_STACK_PADDING 8
 # endif
-#else
-# define TOP_OF_KERNEL_STACK_PADDING 0
+#else /* x86-64 */
+# ifdef CONFIG_X86_FRED
+#  define TOP_OF_KERNEL_STACK_PADDING (2 * 8)
+# else
+#  define TOP_OF_KERNEL_STACK_PADDING 0
+# endif
 #endif
 
 /*
-- 
2.43.0

[PATCH v13 19/35] x86/fred: No ESPFIX needed when FRED is enabled

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Because FRED always restores the full value of %rsp, ESPFIX is
no longer needed when it's enabled.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/kernel/espfix_64.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 16f9814c9be0..6726e0473d0b 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -106,6 +106,10 @@ void __init init_espfix_bsp(void)
pgd_t *pgd;
p4d_t *p4d;
 
+   /* FRED systems always restore the full value of %rsp */
+   if (cpu_feature_enabled(X86_FEATURE_FRED))
+   return;
+
/* Install the espfix pud into the kernel page directory */
pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
@@ -129,6 +133,10 @@ void init_espfix_ap(int cpu)
void *stack_page;
pteval_t ptemask;
 
+   /* FRED systems always restore the full value of %rsp */
+   if (cpu_feature_enabled(X86_FEATURE_FRED))
+   return;
+
/* We only have to do this once... */
if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
-- 
2.43.0

[PATCH v13 21/35] x86/fred: Make exc_page_fault() work for FRED

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

On a FRED system, the faulting address (CR2) is passed on the stack,
to avoid the problem of transient state.  Thus the page fault address
is read from the FRED stack frame instead of CR2 when FRED is enabled.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Change since v12:
* No "we" in a commit message, use passive voice (Borislav Petkov).
---
 arch/x86/mm/fault.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ab778eac1952..7675bc067153 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -34,6 +34,7 @@
 #include   /* kvm_handle_async_pf  */
 #include   /* fixup_vdso_exception()   */
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -1516,8 +1517,10 @@ handle_page_fault(struct pt_regs *regs, unsigned long 
error_code,
 
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
-   unsigned long address = read_cr2();
irqentry_state_t state;
+   unsigned long address;
+
+   address = cpu_feature_enabled(X86_FEATURE_FRED) ? fred_event_data(regs) 
: read_cr2();
 
prefetchw(¤t->mm->mmap_lock);
 
-- 
2.43.0

[PATCH v13 18/35] x86/fred: Disallow the swapgs instruction when FRED is enabled

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

SWAPGS is no longer needed thus NOT allowed with FRED because FRED
transitions ensure that an operating system can _always_ operate
with its own GS base address:
- For events that occur in ring 3, FRED event delivery swaps the GS
  base address with the IA32_KERNEL_GS_BASE MSR.
- ERETU (the FRED transition that returns to ring 3) also swaps the
  GS base address with the IA32_KERNEL_GS_BASE MSR.

And the operating system can still setup the GS segment for a user
thread without the need of loading a user thread GS with:
- Using LKGS, available with FRED, to modify other attributes of the
  GS segment without compromising its ability always to operate with
  its own GS base address.
- Accessing the GS segment base address for a user thread as before
  using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.

Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE MSR
instead of the GS segment's descriptor cache. As such, the operating
system never changes its runtime GS base address.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Change since v12:
* Use the ASCII char ' (char number 0x27), instead of its unicode char
  (Borislav Petkov).

Change since v8:
* Explain why writing directly to the IA32_KERNEL_GS_BASE MSR is
  doing the right thing (Thomas Gleixner).
---
 arch/x86/kernel/process_64.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b924477c5ba8..7f66c0b14de6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -166,7 +166,29 @@ static noinstr unsigned long __rdgsbase_inactive(void)
 
lockdep_assert_irqs_disabled();
 
-   if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+   /*
+* SWAPGS is no longer needed thus NOT allowed with FRED because
+* FRED transitions ensure that an operating system can _always_
+* operate with its own GS base address:
+* - For events that occur in ring 3, FRED event delivery swaps
+*   the GS base address with the IA32_KERNEL_GS_BASE MSR.
+* - ERETU (the FRED transition that returns to ring 3) also swaps
+*   the GS base address with the IA32_KERNEL_GS_BASE MSR.
+*
+* And the operating system can still setup the GS segment for a
+* user thread without the need of loading a user thread GS with:
+* - Using LKGS, available with FRED, to modify other attributes
+*   of the GS segment without compromising its ability always to
+*   operate with its own GS base address.
+* - Accessing the GS segment base address for a user thread as
+*   before using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.
+*
+* Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE
+* MSR instead of the GS segment’s descriptor cache. As such, the
+* operating system never changes its runtime GS base address.
+*/
+   if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+   !cpu_feature_enabled(X86_FEATURE_XENPV)) {
native_swapgs();
gsbase = rdgsbase();
native_swapgs();
@@ -191,7 +213,8 @@ static noinstr void __wrgsbase_inactive(unsigned long 
gsbase)
 {
lockdep_assert_irqs_disabled();
 
-   if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+   if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+   !cpu_feature_enabled(X86_FEATURE_XENPV)) {
native_swapgs();
wrgsbase(gsbase);
native_swapgs();
-- 
2.43.0

[PATCH v13 20/35] x86/fred: Allow single-step trap and NMI when starting a new task

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Entering a new task is logically speaking a return from a system call
(exec, fork, clone, etc.). As such, if ptrace enables single stepping
a single step exception should be allowed to trigger immediately upon
entering user space. This is not optional.

NMI should *never* be disabled in user space. As such, this is an
optional, opportunistic way to catch errors.

Allow single-step trap and NMI when starting a new task, thus once
the new task enters user space, single-step trap and NMI are both
enabled immediately.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v8:
* Use high-order 48 bits above the lowest 16 bit SS only when FRED
  is enabled (Thomas Gleixner).
---
 arch/x86/kernel/process_64.c | 38 ++--
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 7f66c0b14de6..7062b84dd467 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_IA32_EMULATION
 /* Not included via unistd.h */
 #include 
@@ -528,7 +529,7 @@ void x86_gsbase_write_task(struct task_struct *task, 
unsigned long gsbase)
 static void
 start_thread_common(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp,
-   unsigned int _cs, unsigned int _ss, unsigned int _ds)
+   u16 _cs, u16 _ss, u16 _ds)
 {
WARN_ON_ONCE(regs != current_pt_regs());
 
@@ -545,11 +546,36 @@ start_thread_common(struct pt_regs *regs, unsigned long 
new_ip,
loadsegment(ds, _ds);
load_gs_index(0);
 
-   regs->ip= new_ip;
-   regs->sp= new_sp;
-   regs->cs= _cs;
-   regs->ss= _ss;
-   regs->flags = X86_EFLAGS_IF;
+   regs->ip= new_ip;
+   regs->sp= new_sp;
+   regs->csx   = _cs;
+   regs->ssx   = _ss;
+   /*
+* Allow single-step trap and NMI when starting a new task, thus
+* once the new task enters user space, single-step trap and NMI
+* are both enabled immediately.
+*
+* Entering a new task is logically speaking a return from a
+* system call (exec, fork, clone, etc.). As such, if ptrace
+* enables single stepping a single step exception should be
+* allowed to trigger immediately upon entering user space.
+* This is not optional.
+*
+* NMI should *never* be disabled in user space. As such, this
+* is an optional, opportunistic way to catch errors.
+*
+* Paranoia: High-order 48 bits above the lowest 16 bit SS are
+* discarded by the legacy IRET instruction on all Intel, AMD,
+* and Cyrix/Centaur/VIA CPUs, thus can be set unconditionally,
+* even when FRED is not enabled. But we choose the safer side
+* to use these bits only when FRED is enabled.
+*/
+   if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+   regs->fred_ss.swevent   = true;
+   regs->fred_ss.nmi   = true;
+   }
+
+   regs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
 }
 
 void
-- 
2.43.0

[PATCH v13 24/35] x86/fred: Add a NMI entry stub for FRED

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

On a FRED system, NMIs nest both with themselves and faults, transient
information is saved into the stack frame, and NMI unblocking only
happens when the stack frame indicates that so should happen.

Thus, the NMI entry stub for FRED is really quite small...

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/kernel/nmi.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 17e955ab69fe..56350d839e44 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -651,6 +652,33 @@ void nmi_backtrace_stall_check(const struct cpumask *btp)
 
 #endif
 
+#ifdef CONFIG_X86_FRED
+/*
+ * With FRED, CR2/DR6 is pushed to #PF/#DB stack frame during FRED
+ * event delivery, i.e., there is no problem of transient states.
+ * And NMI unblocking only happens when the stack frame indicates
+ * that so should happen.
+ *
+ * Thus, the NMI entry stub for FRED is really straightforward and
+ * as simple as most exception handlers. As such, #DB is allowed
+ * during NMI handling.
+ */
+DEFINE_FREDENTRY_NMI(exc_nmi)
+{
+   irqentry_state_t irq_state;
+
+   if (IS_ENABLED(CONFIG_SMP) && arch_cpu_is_offline(smp_processor_id()))
+   return;
+
+   irq_state = irqentry_nmi_enter(regs);
+
+   inc_irq_stat(__nmi_count);
+   default_do_nmi(regs);
+
+   irqentry_nmi_exit(regs, irq_state);
+}
+#endif
+
 void stop_nmi(void)
 {
ignore_nmis++;
-- 
2.43.0

[PATCH v13 23/35] x86/fred: Add a debug fault entry stub for FRED

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

When occurred on different ring level, i.e., from user or kernel context,
#DB needs to be handled on different stack: User #DB on current task
stack, while kernel #DB on a dedicated stack. This is exactly how FRED
event delivery invokes an exception handler: ring 3 event on level 0
stack, i.e., current task stack; ring 0 event on the #DB dedicated stack
specified in the IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED debug
exception entry stub doesn't do stack switch.

On a FRED system, the debug trap status information (DR6) is passed on
the stack, to avoid the problem of transient state. Furthermore, FRED
transitions avoid a lot of ugly corner cases the handling of which can,
and should be, skipped.

The FRED debug trap status information saved on the stack differs from
DR6 in both stickiness and polarity; it is exactly in the format which
debug_read_clear_dr6() returns for the IDT entry points.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v9:
* Disable #DB to avoid endless recursion and stack overflow when a
  watchpoint/breakpoint is set in the code path which is executed by
  #DB handler (Thomas Gleixner).

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
  handler (Peter Zijlstra).
---
 arch/x86/kernel/traps.c | 43 -
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c876f1d36a81..848c85208a57 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -934,8 +935,7 @@ static bool notify_debug(struct pt_regs *regs, unsigned 
long *dr6)
return false;
 }
 
-static __always_inline void exc_debug_kernel(struct pt_regs *regs,
-unsigned long dr6)
+static noinstr void exc_debug_kernel(struct pt_regs *regs, unsigned long dr6)
 {
/*
 * Disable breakpoints during exception handling; recursive exceptions
@@ -947,6 +947,11 @@ static __always_inline void exc_debug_kernel(struct 
pt_regs *regs,
 *
 * Entry text is excluded for HW_BP_X and cpu_entry_area, which
 * includes the entry stack is excluded for everything.
+*
+* For FRED, nested #DB should just work fine. But when a watchpoint or
+* breakpoint is set in the code path which is executed by #DB handler,
+* it results in an endless recursion and stack overflow. Thus we stay
+* with the IDT approach, i.e., save DR7 and disable #DB.
 */
unsigned long dr7 = local_db_save();
irqentry_state_t irq_state = irqentry_nmi_enter(regs);
@@ -976,7 +981,8 @@ static __always_inline void exc_debug_kernel(struct pt_regs 
*regs,
 * Catch SYSENTER with TF set and clear DR_STEP. If this hit a
 * watchpoint at the same time then that will still be handled.
 */
-   if ((dr6 & DR_STEP) && is_sysenter_singlestep(regs))
+   if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+   (dr6 & DR_STEP) && is_sysenter_singlestep(regs))
dr6 &= ~DR_STEP;
 
/*
@@ -1008,8 +1014,7 @@ static __always_inline void exc_debug_kernel(struct 
pt_regs *regs,
local_db_restore(dr7);
 }
 
-static __always_inline void exc_debug_user(struct pt_regs *regs,
-  unsigned long dr6)
+static noinstr void exc_debug_user(struct pt_regs *regs, unsigned long dr6)
 {
bool icebp;
 
@@ -1093,6 +1098,34 @@ DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
exc_debug_user(regs, debug_read_clear_dr6());
 }
+
+#ifdef CONFIG_X86_FRED
+/*
+ * When occurred on different ring level, i.e., from user or kernel
+ * context, #DB needs to be handled on different stack: User #DB on
+ * current task stack, while kernel #DB on a dedicated stack.
+ *
+ * This is exactly how FRED event delivery invokes an exception
+ * handler: ring 3 event on level 0 stack, i.e., current task stack;
+ * ring 0 event on the #DB dedicated stack specified in the
+ * IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED debug exception
+ * entry stub doesn't do stack switch.
+ */
+DEFINE_FREDENTRY_DEBUG(exc_debug)
+{
+   /*
+* FRED #DB stores DR6 on the stack in the format which
+* debug_read_clear_dr6() returns for the IDT entry points.
+*/
+   unsigned long dr6 = fred_event_data(regs);
+
+   if (user_mode(regs))
+   exc_debug_user(regs, dr6);
+   else
+   exc_debug_kernel(regs, dr6);
+}
+#endif /* CONFIG_X86_FRED */
+
 #else
 /* 32 bit does not have separate entry points. */
 DEFINE_IDTENTRY_RAW(exc_debug)
-- 
2.43.0

[PATCH v13 25/35] x86/fred: Add a machine check entry stub for FRED

2023-12-05 Thread Xin Li

Like #DB, when occurred on different ring level, i.e., from user or kernel
context, #MCE needs to be handled on different stack: User #MCE on current
task stack, while kernel #MCE on a dedicated stack.

This is exactly how FRED event delivery invokes an exception handler: ring
3 event on level 0 stack, i.e., current task stack; ring 0 event on the
#MCE dedicated stack specified in the IA32_FRED_STKLVLS MSR. So unlike IDT,
the FRED machine check entry stub doesn't do stack switch.

Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v5:
* Disallow #DB inside #MCE for robustness sake (Peter Zijlstra).
---
 arch/x86/kernel/cpu/mce/core.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 1642018dd6c9..d524eb87f76c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -2150,6 +2151,31 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
exc_machine_check_user(regs);
local_db_restore(dr7);
 }
+
+#ifdef CONFIG_X86_FRED
+/*
+ * When occurred on different ring level, i.e., from user or kernel
+ * context, #MCE needs to be handled on different stack: User #MCE
+ * on current task stack, while kernel #MCE on a dedicated stack.
+ *
+ * This is exactly how FRED event delivery invokes an exception
+ * handler: ring 3 event on level 0 stack, i.e., current task stack;
+ * ring 0 event on the #MCE dedicated stack specified in the
+ * IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED machine check entry
+ * stub doesn't do stack switch.
+ */
+DEFINE_FREDENTRY_MCE(exc_machine_check)
+{
+   unsigned long dr7;
+
+   dr7 = local_db_save();
+   if (user_mode(regs))
+   exc_machine_check_user(regs);
+   else
+   exc_machine_check_kernel(regs);
+   local_db_restore(dr7);
+}
+#endif
 #else
 /* 32bit unified entry point */
 DEFINE_IDTENTRY_RAW(exc_machine_check)
-- 
2.43.0

[PATCH v13 22/35] x86/idtentry: Incorporate definitions/declarations of the FRED entries

2023-12-05 Thread Xin Li

FRED and IDT can share most of the definitions and declarations so
that in the majority of cases the actual handler implementation is the
same.

The differences are the exceptions where FRED stores exception related
information on the stack and the sysvec implementations as FRED can
handle irqentry/exit() in the dispatcher instead of having it in each
handler.

Also add stub defines for vectors which are not used due to Kconfig
decisions to spare the ifdeffery in the actual FRED dispatch code.

Suggested-by: Thomas Gleixner 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Change since v9:
* Except NMI/#DB/#MCE, FRED really should share the exception handlers
  with IDT (Thomas Gleixner).

Change since v8:
* Put IDTENTRY changes in a separate patch (Thomas Gleixner).
---
 arch/x86/include/asm/idtentry.h | 71 +
 1 file changed, 63 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index cfca68f6cb84..4f26ee9b8b74 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -13,15 +13,18 @@
 
 #include 
 
+typedef void (*idtentry_t)(struct pt_regs *regs);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
  * @vector:Vector number (ignored for C)
  * @func:  Function name of the entry point
  *
- * Declares three functions:
+ * Declares four functions:
  * - The ASM entry point: asm_##func
  * - The XEN PV trap entry point: xen_##func (maybe unused)
+ * - The C handler called from the FRED event dispatcher (maybe unused)
  * - The C handler called from the ASM entry point
  *
  * Note: This is the C variant of DECLARE_IDTENTRY(). As the name says it
@@ -31,6 +34,7 @@
 #define DECLARE_IDTENTRY(vector, func) \
asmlinkage void asm_##func(void);   \
asmlinkage void xen_asm_##func(void);   \
+   void fred_##func(struct pt_regs *regs); \
__visible void func(struct pt_regs *regs)
 
 /**
@@ -137,6 +141,17 @@ static __always_inline void __##func(struct pt_regs *regs, 
\
 #define DEFINE_IDTENTRY_RAW(func)  \
 __visible noinstr void func(struct pt_regs *regs)
 
+/**
+ * DEFINE_FREDENTRY_RAW - Emit code for raw FRED entry points
+ * @func:  Function name of the entry point
+ *
+ * @func is called from the FRED event dispatcher with interrupts disabled.
+ *
+ * See @DEFINE_IDTENTRY_RAW for further details.
+ */
+#define DEFINE_FREDENTRY_RAW(func) \
+noinstr void fred_##func(struct pt_regs *regs)
+
 /**
  * DECLARE_IDTENTRY_RAW_ERRORCODE - Declare functions for raw IDT entry points
  * Error code pushed by hardware
@@ -233,17 +248,27 @@ static noinline void __##func(struct pt_regs *regs, u32 
vector)
 #define DEFINE_IDTENTRY_SYSVEC(func)   \
 static void __##func(struct pt_regs *regs);\
\
+static __always_inline void instr_##func(struct pt_regs *regs) \
+{  \
+   kvm_set_cpu_l1tf_flush_l1d();   \
+   run_sysvec_on_irqstack_cond(__##func, regs);\
+}  \
+   \
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   kvm_set_cpu_l1tf_flush_l1d();   \
-   run_sysvec_on_irqstack_cond(__##func, regs);\
+   instr_##func (regs);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
\
+void fred_##func(struct pt_regs *regs) \
+{  \
+   instr_##func (regs);\
+}  \
+   \
 static noinline void __##func(struct pt_regs *regs)
 
 /**
@@ -260,19 +285,29 @@ static

[PATCH v13 28/35] x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled,
otherwise the existing IDT code is chosen.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/entry/entry_64.S  | 6 ++
 arch/x86/entry/entry_64_fred.S | 1 +
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 5a1660701623..87d817296dcb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -247,7 +247,13 @@ SYM_CODE_START(ret_from_fork_asm)
 * and unwind should work normally.
 */
UNWIND_HINT_REGS
+
+#ifdef CONFIG_X86_FRED
+   ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
+   "jmp asm_fred_exit_user", X86_FEATURE_FRED
+#else
jmp swapgs_restore_regs_and_return_to_usermode
+#endif
 SYM_CODE_END(ret_from_fork_asm)
 .popsection
 
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 37a1dd5e8ace..5781c3411b44 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -32,6 +32,7 @@
 SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
FRED_ENTER
callfred_entry_from_user
+SYM_INNER_LABEL(asm_fred_exit_user, SYM_L_GLOBAL)
FRED_EXIT
ERETU
 SYM_CODE_END(asm_fred_entrypoint_user)
-- 
2.43.0

[PATCH v13 29/35] x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user

2023-12-05 Thread Xin Li

If the stack frame contains an invalid user context (e.g. due to invalid SS,
a non-canonical RIP, etc.) the ERETU instruction will trap (#SS or #GP).

>From a Linux point of view, this really should be considered a user space
failure, so use the standard fault fixup mechanism to intercept the fault,
fix up the exception frame, and redirect execution to fred_entrypoint_user.
The end result is that it appears just as if the hardware had taken the
exception immediately after completing the transition to user space.

Suggested-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v8:
* Reflect the FRED spec 5.0 change that ERETS and ERETU add 8 to %rsp
  before popping the return context from the stack.

Changes since v6:
* Add a comment to explain why it is safe to write to the previous FRED stack
  frame. (Lai Jiangshan).

Changes since v5:
* Move the NMI bit from an invalid stack frame, which caused ERETU to fault,
  to the fault handler's stack frame, thus to unblock NMI ASAP if NMI is blocked
  (Lai Jiangshan).
---
 arch/x86/entry/entry_64_fred.S |  5 +-
 arch/x86/include/asm/extable_fixup_types.h |  4 +-
 arch/x86/mm/extable.c  | 79 ++
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 5781c3411b44..d1c2fc4af8ae 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -3,6 +3,7 @@
  * The actual FRED entry points.
  */
 
+#include 
 #include 
 
 #include "calling.h"
@@ -34,7 +35,9 @@ SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
callfred_entry_from_user
 SYM_INNER_LABEL(asm_fred_exit_user, SYM_L_GLOBAL)
FRED_EXIT
-   ERETU
+1: ERETU
+
+   _ASM_EXTABLE_TYPE(1b, asm_fred_entrypoint_user, EX_TYPE_ERETU)
 SYM_CODE_END(asm_fred_entrypoint_user)
 
 .fill asm_fred_entrypoint_kernel - ., 1, 0xcc
diff --git a/arch/x86/include/asm/extable_fixup_types.h 
b/arch/x86/include/asm/extable_fixup_types.h
index 991e31cfde94..1585c798a02f 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -64,6 +64,8 @@
 #defineEX_TYPE_UCOPY_LEN4  (EX_TYPE_UCOPY_LEN | 
EX_DATA_IMM(4))
 #defineEX_TYPE_UCOPY_LEN8  (EX_TYPE_UCOPY_LEN | 
EX_DATA_IMM(8))
 
-#define EX_TYPE_ZEROPAD20 /* longword load with 
zeropad on fault */
+#defineEX_TYPE_ZEROPAD 20 /* longword load with 
zeropad on fault */
+
+#defineEX_TYPE_ERETU   21
 
 #endif
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 271dcb2deabc..fc40a4e12f3a 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -6,6 +6,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -223,6 +224,80 @@ static bool ex_handler_ucopy_len(const struct 
exception_table_entry *fixup,
return ex_handler_uaccess(fixup, regs, trapnr, fault_address);
 }
 
+#ifdef CONFIG_X86_FRED
+static bool ex_handler_eretu(const struct exception_table_entry *fixup,
+struct pt_regs *regs, unsigned long error_code)
+{
+   struct pt_regs *uregs = (struct pt_regs *)
+   (regs->sp - offsetof(struct pt_regs, orig_ax));
+   unsigned short ss = uregs->ss;
+   unsigned short cs = uregs->cs;
+
+   /*
+* Move the NMI bit from the invalid stack frame, which caused ERETU
+* to fault, to the fault handler's stack frame, thus to unblock NMI
+* with the fault handler's ERETS instruction ASAP if NMI is blocked.
+*/
+   regs->fred_ss.nmi = uregs->fred_ss.nmi;
+
+   /*
+* Sync event information to uregs, i.e., the ERETU return frame, but
+* is it safe to write to the ERETU return frame which is just above
+* current event stack frame?
+*
+* The RSP used by FRED to push a stack frame is not the value in %rsp,
+* it is calculated from %rsp with the following 2 steps:
+* 1) RSP = %rsp - (IA32_FRED_CONFIG & 0x1c0)   // Reserve N*64 bytes
+* 2) RSP = RSP & ~0x3f // Align to a 64-byte cache line
+* when an event delivery doesn't trigger a stack level change.
+*
+* Here is an example with N*64 (N=1) bytes reserved:
+*
+*  64-byte cache line ==>  __
+* |___Reserved___|
+* |__Event_data__|
+* |_SS___|
+* |_RSP__|
+* |_FLAGS|
+* |_CS___|
+* |_IP___|
+*  64-byte cache line ==> |__Error_code__| <== ERETU return frame
+* |__|
+* |__|
+

[PATCH v13 27/35] x86/traps: Add sysvec_install() to install a system interrupt handler

2023-12-05 Thread Xin Li

Add sysvec_install() to install a system interrupt handler into the IDT
or the FRED system interrupt handler table.

Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v8:
* Introduce a macro sysvec_install() to derive the asm handler name from
  a C handler, which simplifies the code and avoids an ugly typecast
  (Thomas Gleixner).
---
 arch/x86/entry/entry_fred.c  | 14 ++
 arch/x86/include/asm/desc.h  |  2 --
 arch/x86/include/asm/idtentry.h  | 15 +++
 arch/x86/kernel/cpu/acrn.c   |  4 ++--
 arch/x86/kernel/cpu/mshyperv.c   | 15 +++
 arch/x86/kernel/idt.c|  4 ++--
 arch/x86/kernel/kvm.c|  2 +-
 drivers/xen/events/events_base.c |  2 +-
 8 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index 215883e90f94..e80e3efbc057 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -126,6 +126,20 @@ static idtentry_t sysvec_table[NR_SYSTEM_VECTORS] 
__ro_after_init = {
SYSVEC(POSTED_INTR_NESTED_VECTOR,   kvm_posted_intr_nested_ipi),
 };
 
+static bool fred_setup_done __initdata;
+
+void __init fred_install_sysvec(unsigned int sysvec, idtentry_t handler)
+{
+   if (WARN_ON_ONCE(sysvec < FIRST_SYSTEM_VECTOR))
+   return;
+
+   if (WARN_ON_ONCE(fred_setup_done))
+   return;
+
+   if (!WARN_ON_ONCE(sysvec_table[sysvec - FIRST_SYSTEM_VECTOR]))
+sysvec_table[sysvec - FIRST_SYSTEM_VECTOR] = handler;
+}
+
 static noinstr void fred_extint(struct pt_regs *regs)
 {
unsigned int vector = regs->fred_ss.vector;
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ab97b22ac04a..ec95fe44fa3a 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -402,8 +402,6 @@ static inline void set_desc_limit(struct desc_struct *desc, 
unsigned long limit)
desc->limit1 = (limit >> 16) & 0xf;
 }
 
-void alloc_intr_gate(unsigned int n, const void *addr);
-
 static inline void init_idt_data(struct idt_data *data, unsigned int n,
 const void *addr)
 {
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4f26ee9b8b74..650c98160152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -459,6 +459,21 @@ __visible noinstr void func(struct pt_regs *regs,  
\
 #define DEFINE_FREDENTRY_DEBUG DEFINE_FREDENTRY_RAW
 #endif
 
+void idt_install_sysvec(unsigned int n, const void *function);
+
+#ifdef CONFIG_X86_FRED
+void fred_install_sysvec(unsigned int vector, const idtentry_t function);
+#else
+static inline void fred_install_sysvec(unsigned int vector, const idtentry_t 
function) { }
+#endif
+
+#define sysvec_install(vector, function) { \
+   if (cpu_feature_enabled(X86_FEATURE_FRED))  \
+   fred_install_sysvec(vector, function);  \
+   else\
+   idt_install_sysvec(vector, asm_##function); \
+}
+
 #else /* !__ASSEMBLY__ */
 
 /*
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index bfeb18fad63f..2c5b51aad91a 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -26,8 +26,8 @@ static u32 __init acrn_detect(void)
 
 static void __init acrn_init_platform(void)
 {
-   /* Setup the IDT for ACRN hypervisor callback */
-   alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, 
asm_sysvec_acrn_hv_callback);
+   /* Install system interrupt handler for ACRN hypervisor callback */
+   sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
 
x86_platform.calibrate_tsc = acrn_get_tsc_khz;
x86_platform.calibrate_cpu = acrn_get_tsc_khz;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 01fa06dd06b6..45e0e70e238c 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -539,19 +539,18 @@ static void __init ms_hyperv_init_platform(void)
 */
x86_platform.apic_post_init = hyperv_init;
hyperv_setup_mmu_ops();
-   /* Setup the IDT for hypervisor callback */
-   alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_hyperv_callback);
 
-   /* Setup the IDT for reenlightenment notifications */
+   /* Install system interrupt handler for hypervisor callback */
+   sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_hyperv_callback);
+
+   /* Install system interrupt handler for reenlightenment notifications */
if (ms_hyperv.features & HV_ACCESS_REENLIGHTENMENT) {
-   alloc_intr_gate(HYPERV_REENLIGHTENMENT_VECTOR,
-   asm_sysvec_hyperv_reenlightenment);
+   sysvec_install(HYPERV_REENLIGHTENMENT_VECTOR, 
sysvec_hyperv_reenlightenment);
}
 
-   /*

[PATCH v13 26/35] x86/fred: FRED entry/exit and dispatch code

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

The code to actually handle kernel and event entry/exit using
FRED. It is split up into two files thus:

- entry_64_fred.S contains the actual entrypoints and exit code, and
  saves and restores registers.
- entry_fred.c contains the two-level event dispatch code for FRED.
  The first-level dispatch is on the event type, and the second-level
  is on the event vector.

Originally-by: Megha Dey 
Signed-off-by: H. Peter Anvin (Intel) 
Suggested-by: Thomas Gleixner 
Tested-by: Shan Kang 
Co-developed-by: Xin Li 
Signed-off-by: Xin Li 
---

Changes since v10:
* Replace "IS_ENABLED(CONFIG_IA32_EMULATION)" with the new ia32_enabled()
  API (Nikolay Borisov).

Changes since v9:
* Don't use jump tables, indirect jumps are expensive (Thomas Gleixner).
* Except NMI/#DB/#MCE, FRED really can share the exception handlers
  with IDT (Thomas Gleixner).
* Avoid the sysvec_* idt_entry muck, do it at a central place, reuse code
  instead of blindly copying it, which breaks the performance optimized
  sysvec entries like reschedule_ipi (Thomas Gleixner).
* Add asm_ prefix to FRED asm entry points (Thomas Gleixner).

Changes since v8:
* Don't do syscall early out in fred_entry_from_user() before there are
  proper performance numbers and justifications (Thomas Gleixner).
* Add the control exception handler to the FRED exception handler table
  (Thomas Gleixner).
* Add ENDBR to the FRED_ENTER asm macro.
* Reflect the FRED spec 5.0 change that ERETS and ERETU add 8 to %rsp
  before popping the return context from the stack.

Changes since v1:
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
  if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
  down into individual FRED exception handlers, instead of in the dispatch
  framework (Peter Zijlstra).
---
 arch/x86/entry/Makefile   |   5 +-
 arch/x86/entry/entry_64_fred.S|  52 ++
 arch/x86/entry/entry_fred.c   | 230 ++
 arch/x86/include/asm/asm-prototypes.h |   1 +
 arch/x86/include/asm/fred.h   |   6 +
 5 files changed, 293 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/entry/entry_64_fred.S
 create mode 100644 arch/x86/entry/entry_fred.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index ca2fe186994b..c93e7f5c2a06 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -18,6 +18,9 @@ obj-y += vdso/
 obj-y  += vsyscall/
 
 obj-$(CONFIG_PREEMPTION)   += thunk_$(BITS).o
+CFLAGS_entry_fred.o+= -fno-stack-protector
+CFLAGS_REMOVE_entry_fred.o += -pg $(CC_FLAGS_FTRACE)
+obj-$(CONFIG_X86_FRED) += entry_64_fred.o entry_fred.o
+
 obj-$(CONFIG_IA32_EMULATION)   += entry_64_compat.o syscall_32.o
 obj-$(CONFIG_X86_X32_ABI)  += syscall_x32.o
-
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
new file mode 100644
index ..37a1dd5e8ace
--- /dev/null
+++ b/arch/x86/entry/entry_64_fred.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * The actual FRED entry points.
+ */
+
+#include 
+
+#include "calling.h"
+
+   .code64
+   .section .noinstr.text, "ax"
+
+.macro FRED_ENTER
+   UNWIND_HINT_END_OF_STACK
+   ENDBR
+   PUSH_AND_CLEAR_REGS
+   movq%rsp, %rdi  /* %rdi -> pt_regs */
+.endm
+
+.macro FRED_EXIT
+   UNWIND_HINT_REGS
+   POP_REGS
+.endm
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
+ * Thus the FRED ring 3 entry point must be 4K page aligned.
+ */
+   .align 4096
+
+SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
+   FRED_ENTER
+   callfred_entry_from_user
+   FRED_EXIT
+   ERETU
+SYM_CODE_END(asm_fred_entrypoint_user)
+
+.fill asm_fred_entrypoint_kernel - ., 1, 0xcc
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * (IA32_FRED_CONFIG & ~FFFH) + 256 for events that occur in
+ * ring 0, i.e., asm_fred_entrypoint_user + 256.
+ */
+   .org asm_fred_entrypoint_user + 256
+SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
+   FRED_ENTER
+   callfred_entry_from_kernel
+   FRED_EXIT
+   ERETS
+SYM_CODE_END(asm_fred_entrypoint_kernel)
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
new file mode 100644
index ..215883e90f94
--- /dev/null
+++ b/arch/x86/entry/entry_fred.c
@@ -0,0 +1,230 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * The FRED specific kernel/user entry functions which are invoked from
+ * assembly code and dispatch to the associated handlers.
+ */
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* FRED EVENT_TYPE_OTHER vector numbers */
+#define FRED_SYSCALL   1
+#define FRED_SYSENTER

[PATCH v13 30/35] x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code

2023-12-05 Thread Xin Li

From: "Peter Zijlstra (Intel)" 

PUSH_AND_CLEAR_REGS could be used besides actual entry code; in that case
%rbp shouldn't be cleared (otherwise the frame pointer is destroyed) and
UNWIND_HINT shouldn't be added.

Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/entry/calling.h | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index e59d3073e7cf..a023d9a97cd2 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -65,7 +65,7 @@ For 32-bit we have the following conventions - kernel is 
built with
  * for assembly code:
  */
 
-.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
+.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
.if \save_ret
pushq   %rsi/* pt_regs->si */
movq8(%rsp), %rsi   /* temporarily store the return address in %rsi 
*/
@@ -87,14 +87,17 @@ For 32-bit we have the following conventions - kernel is 
built with
pushq   %r13/* pt_regs->r13 */
pushq   %r14/* pt_regs->r14 */
pushq   %r15/* pt_regs->r15 */
+
+   .if \unwind_hint
UNWIND_HINT_REGS
+   .endif
 
.if \save_ret
pushq   %rsi/* return address on top of stack */
.endif
 .endm
 
-.macro CLEAR_REGS
+.macro CLEAR_REGS clear_bp=1
/*
 * Sanitize registers of values that a speculation attack might
 * otherwise want to exploit. The lower registers are likely clobbered
@@ -109,7 +112,9 @@ For 32-bit we have the following conventions - kernel is 
built with
xorl%r10d, %r10d/* nospec r10 */
xorl%r11d, %r11d/* nospec r11 */
xorl%ebx,  %ebx /* nospec rbx */
+   .if \clear_bp
xorl%ebp,  %ebp /* nospec rbp */
+   .endif
xorl%r12d, %r12d/* nospec r12 */
xorl%r13d, %r13d/* nospec r13 */
xorl%r14d, %r14d/* nospec r14 */
@@ -117,9 +122,9 @@ For 32-bit we have the following conventions - kernel is 
built with
 
 .endm
 
-.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
-   PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret
-   CLEAR_REGS
+.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 
unwind_hint=1
+   PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret 
unwind_hint=\unwind_hint
+   CLEAR_REGS clear_bp=\clear_bp
 .endm
 
 .macro POP_REGS pop_rdi=1
-- 
2.43.0

[PATCH v13 32/35] KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling

2023-12-05 Thread Xin Li

When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in
IRQ/NMI induced VM exits.

Tested-by: Shan Kang 
Signed-off-by: Xin Li 
Acked-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx/vmx.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1..ba5cd26137e0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -6962,14 +6963,16 @@ static void handle_external_interrupt_irqoff(struct 
kvm_vcpu *vcpu)
 {
u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-   gate_desc *desc = (gate_desc *)host_idt_base + vector;
 
if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
"unexpected VM-Exit interrupt info: 0x%x", intr_info))
return;
 
kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-   vmx_do_interrupt_irqoff(gate_offset(desc));
+   if (cpu_feature_enabled(X86_FEATURE_FRED))
+   fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
+   else
+   vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base 
+ vector));
kvm_after_interrupt(vcpu);
 
vcpu->arch.at_instruction_boundary = true;
@@ -7262,7 +7265,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
is_nmi(vmx_get_intr_info(vcpu))) {
kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-   vmx_do_nmi_irqoff();
+   if (cpu_feature_enabled(X86_FEATURE_FRED))
+   fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
+   else
+   vmx_do_nmi_irqoff();
kvm_after_interrupt(vcpu);
}
 
-- 
2.43.0

[PATCH v13 33/35] x86/syscall: Split IDT syscall setup code into idt_syscall_init()

2023-12-05 Thread Xin Li

Because FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER and
ERETU is the only legit instruction to return to ring 3, there is NO need
to setup SYSCALL and SYSENTER MSRs for FRED, except the IA32_STAR MSR.

Split IDT syscall setup code into idt_syscall_init() to make it easy to
skip syscall setup code when FRED is enabled.

Suggested-by: Thomas Gleixner 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---
 arch/x86/kernel/cpu/common.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 132f41f7c27f..9a075792e275 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2076,10 +2076,8 @@ static void wrmsrl_cstar(unsigned long val)
wrmsrl(MSR_CSTAR, val);
 }
 
-/* May not be marked __init: used by software suspend */
-void syscall_init(void)
+static inline void idt_syscall_init(void)
 {
-   wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
if (ia32_enabled()) {
@@ -2113,6 +2111,15 @@ void syscall_init(void)
   X86_EFLAGS_AC|X86_EFLAGS_ID);
 }
 
+/* May not be marked __init: used by software suspend */
+void syscall_init(void)
+{
+   /* The default user and kernel segments */
+   wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
+
+   idt_syscall_init();
+}
+
 #else  /* CONFIG_X86_64 */
 
 #ifdef CONFIG_STACKPROTECTOR
-- 
2.43.0

[PATCH v13 31/35] x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI

2023-12-05 Thread Xin Li

In IRQ/NMI induced VM exits, KVM VMX needs to execute the respective
handlers, which requires the software to create a FRED stack frame,
and use it to invoke the handlers. Add fred_irq_entry_from_kvm() for
this job.

Export fred_entry_from_kvm() because VMX can be compiled as a module.

Suggested-by: Sean Christopherson 
Suggested-by: Thomas Gleixner 
Tested-by: Shan Kang 
Signed-off-by: Xin Li 
---

Changes since v10:
* Better explain the reason why no need to check current stack level
  (Paolo Bonzini).

Changes since v9:
* Shove the whole thing into arch/x86/entry/entry_64_fred.S for invoking
  external_interrupt() and fred_exc_nmi() (Sean Christopherson).
* Correct and improve a few comments (Sean Christopherson).
* Merge the two IRQ/NMI asm entries into one as it's fine to invoke
  noinstr code from regular code (Thomas Gleixner).
* Setup the long mode and NMI flags in the augmented SS field of FRED
  stack frame in C instead of asm (Thomas Gleixner).
* Add UNWIND_HINT_{SAVE,RESTORE} to get rid of the warning: "objtool:
  asm_fred_entry_from_kvm+0x0: unreachable instruction" (Peter Zijlstra).

Changes since v8:
* Add a new macro VMX_DO_FRED_EVENT_IRQOFF for FRED instead of
  refactoring VMX_DO_EVENT_IRQOFF (Sean Christopherson).
* Do NOT use a trampoline, just LEA+PUSH the return RIP, PUSH the error
  code, and jump to the FRED kernel entry point for NMI or call
  external_interrupt() for IRQs (Sean Christopherson).
* Call external_interrupt() only when FRED is enabled, and convert the
  non-FRED handling to external_interrupt() after FRED lands (Sean
  Christopherson).
---
 arch/x86/entry/entry_64_fred.S | 77 ++
 arch/x86/entry/entry_fred.c| 14 +++
 arch/x86/include/asm/fred.h| 18 
 3 files changed, 109 insertions(+)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index d1c2fc4af8ae..eedf98de7538 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -3,8 +3,11 @@
  * The actual FRED entry points.
  */
 
+#include 
+
 #include 
 #include 
+#include 
 
 #include "calling.h"
 
@@ -54,3 +57,77 @@ SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
FRED_EXIT
ERETS
 SYM_CODE_END(asm_fred_entrypoint_kernel)
+
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+SYM_FUNC_START(asm_fred_entry_from_kvm)
+   push %rbp
+   mov %rsp, %rbp
+
+   UNWIND_HINT_SAVE
+
+   /*
+* Both IRQ and NMI from VMX can be handled on current task stack
+* because there is no need to protect from reentrancy and the call
+* stack leading to this helper is effectively constant and shallow
+* (relatively speaking). Do the same when FRED is active, i.e., no
+* need to check current stack level for a stack switch.
+*
+* Emulate the FRED-defined redzone and stack alignment.
+*/
+   sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
+   and $FRED_STACK_FRAME_RSP_MASK, %rsp
+
+   /*
+* Start to push a FRED stack frame, which is always 64 bytes:
+*
+* ++-+
+* | Bytes  | Usage   |
+* ++-+
+* | 63:56  | Reserved|
+* | 55:48  | Event Data  |
+* | 47:40  | SS + Event Info |
+* | 39:32  | RSP |
+* | 31:24  | RFLAGS  |
+* | 23:16  | CS + Aux Info   |
+* |  15:8  | RIP |
+* |   7:0  | Error Code  |
+* ++-+
+*/
+   push $0 /* Reserved, must be 0 */
+   push $0 /* Event data, 0 for IRQ/NMI */
+   push %rdi   /* fred_ss handed in by the caller */
+   push %rbp
+   pushf
+   mov $__KERNEL_CS, %rax
+   push %rax
+
+   /*
+* Unlike the IDT event delivery, FRED _always_ pushes an error code
+* after pushing the return RIP, thus the CALL instruction CANNOT be
+* used here to push the return RIP, otherwise there is no chance to
+* push an error code before invoking the IRQ/NMI handler.
+*
+* Use LEA to get the return RIP and push it, then push an error code.
+*/
+   lea 1f(%rip), %rax
+   push %rax   /* Return RIP */
+   push $0 /* Error code, 0 for IRQ/NMI */
+
+   PUSH_AND_CLEAR_REGS clear_bp=0 unwind_hint=0
+   movq %rsp, %rdi /* %rdi -> pt_regs */
+   call __fred_entry_from_kvm  /* Call the C entry point */
+   POP_REGS
+   ERETS
+1:
+   /*
+* Objtool doesn't understand what ERETS does, this hint tells it that
+* yes, we'll reach here and with what stack state. A save/restore pair
+* isn't strictly needed, but it's the simplest form.
+*/
+   UNWIND_HINT_RESTORE
+   pop %rbp
+   RET
+
+SYM_FUN

[PATCH v13 34/35] x86/fred: Add FRED initialization functions

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Add cpu_init_fred_exceptions() to:
  - Set FRED entrypoints for events happening in ring 0 and 3.
  - Specify the stack level for IRQs occurred ring 0.
  - Specify dedicated event stacks for #DB/NMI/#MCE/#DF.
  - Enable FRED and invalidtes IDT.
  - Force 32-bit system calls to use "int $0x80" only.

Add fred_complete_exception_setup() to:
  - Initialize system_vectors as done for IDT systems.
  - Set unused sysvec_table entries to fred_handle_spurious_interrupt().

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Co-developed-by: Xin Li 
Signed-off-by: Xin Li 
---

Changes since v9:
* Set unused sysvec table entries to fred_handle_spurious_interrupt()
  in fred_complete_exception_setup() (Thomas Gleixner).

Changes since v5:
* Add a comment for FRED stack level settings (Lai Jiangshan).
* Define NMI/#DB/#MCE/#DF stack levels using macros.
---
 arch/x86/entry/entry_fred.c | 21 +
 arch/x86/include/asm/fred.h |  5 
 arch/x86/kernel/Makefile|  1 +
 arch/x86/kernel/fred.c  | 59 +
 4 files changed, 86 insertions(+)
 create mode 100644 arch/x86/kernel/fred.c

diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index 3e33a4ab4624..abe66d65fa2d 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -140,6 +140,27 @@ void __init fred_install_sysvec(unsigned int sysvec, 
idtentry_t handler)
 sysvec_table[sysvec - FIRST_SYSTEM_VECTOR] = handler;
 }
 
+static noinstr void fred_handle_spurious_interrupt(struct pt_regs *regs)
+{
+   spurious_interrupt(regs, regs->fred_ss.vector);
+}
+
+void __init fred_complete_exception_setup(void)
+{
+   unsigned int vector;
+
+   for (vector = 0; vector < FIRST_EXTERNAL_VECTOR; vector++)
+   set_bit(vector, system_vectors);
+
+   for (vector = 0; vector < NR_SYSTEM_VECTORS; vector++) {
+   if (sysvec_table[vector])
+   set_bit(vector + FIRST_SYSTEM_VECTOR, system_vectors);
+   else
+   sysvec_table[vector] = fred_handle_spurious_interrupt;
+   }
+   fred_setup_done = true;
+}
+
 static noinstr void fred_extint(struct pt_regs *regs)
 {
unsigned int vector = regs->fred_ss.vector;
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 2fa9f34e5c95..e86c7ba32435 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -83,8 +83,13 @@ static __always_inline void fred_entry_from_kvm(unsigned int 
type, unsigned int
asm_fred_entry_from_kvm(ss);
 }
 
+void cpu_init_fred_exceptions(void);
+void fred_complete_exception_setup(void);
+
 #else /* CONFIG_X86_FRED */
 static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { 
return 0; }
+static inline void cpu_init_fred_exceptions(void) { }
+static inline void fred_complete_exception_setup(void) { }
 static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned 
int vector) { }
 #endif /* CONFIG_X86_FRED */
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 325ab98f..0dcbfc1a4c41 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -48,6 +48,7 @@ obj-y += platform-quirks.o
 obj-y  += process_$(BITS).o signal.o signal_$(BITS).o
 obj-y  += traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
 obj-y  += time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_X86_FRED) += fred.o
 obj-$(CONFIG_MODIFY_LDT_SYSCALL)   += ldt.o
 obj-$(CONFIG_X86_KERNEL_IBT)   += ibt_selftest.o
 obj-y  += setup.o x86_init.o i8259.o irqinit.o
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
new file mode 100644
index ..4bcd8791ad96
--- /dev/null
+++ b/arch/x86/kernel/fred.c
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+/* #DB in the kernel would imply the use of a kernel debugger. */
+#define FRED_DB_STACK_LEVEL1UL
+#define FRED_NMI_STACK_LEVEL   2UL
+#define FRED_MC_STACK_LEVEL2UL
+/*
+ * #DF is the highest level because a #DF means "something went wrong
+ * *while delivering an exception*." The number of cases for which that
+ * can happen with FRED is drastically reduced and basically amounts to
+ * "the stack you pointed me to is broken." Thus, always change stacks
+ * on #DF, which means it should be at the highest level.
+ */
+#define FRED_DF_STACK_LEVEL3UL
+
+#define FRED_STKLVL(vector, lvl)   ((lvl) << (2 * (vector)))
+
+void cpu_init_fred_exceptions(void)
+{
+   /* When FRED is enabled by default, remove this log message */
+   pr_info("Initialize FRED on CPU%d\n", smp_processor_id());
+
+   wrmsrl(MSR_IA32_FRED_CONFIG,
+  /* Reserve for CALL emulation */
+  FRED_CONFIG_REDZONE |
+

[PATCH v13 35/35] x86/fred: Invoke FRED initialization code to enable FRED

2023-12-05 Thread Xin Li

From: "H. Peter Anvin (Intel)" 

Let cpu_init_exception_handling() call cpu_init_fred_exceptions() to
initialize FRED. However if FRED is unavailable or disabled, it falls
back to set up TSS IST and initialize IDT.

Signed-off-by: H. Peter Anvin (Intel) 
Tested-by: Shan Kang 
Co-developed-by: Xin Li 
Signed-off-by: Xin Li 
---

Changes since v10:
* No need to invalidate SYSCALL and SYSENTER MSRs (Thomas Gleixner).

Changes since v8:
* Move this patch after all required changes are in place (Thomas
  Gleixner).
---
 arch/x86/kernel/cpu/common.c | 22 +-
 arch/x86/kernel/irqinit.c|  7 ++-
 arch/x86/kernel/traps.c  |  5 -
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9a075792e275..91d2f6018c48 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -61,6 +61,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2117,7 +2118,15 @@ void syscall_init(void)
/* The default user and kernel segments */
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 
-   idt_syscall_init();
+   /*
+* Except the IA32_STAR MSR, there is NO need to setup SYSCALL and
+* SYSENTER MSRs for FRED, because FRED uses the ring 3 FRED
+* entrypoint for SYSCALL and SYSENTER, and ERETU is the only legit
+* instruction to return to ring 3 (both sysexit and sysret cause
+* #UD when FRED is enabled).
+*/
+   if (!cpu_feature_enabled(X86_FEATURE_FRED))
+   idt_syscall_init();
 }
 
 #else  /* CONFIG_X86_64 */
@@ -2223,8 +2232,9 @@ void cpu_init_exception_handling(void)
/* paranoid_entry() gets the CPU number from the GDT */
setup_getcpu(cpu);
 
-   /* IST vectors need TSS to be set up. */
-   tss_setup_ist(tss);
+   /* For IDT mode, IST vectors need to be set in TSS. */
+   if (!cpu_feature_enabled(X86_FEATURE_FRED))
+   tss_setup_ist(tss);
tss_setup_io_bitmap(tss);
set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
@@ -2233,8 +2243,10 @@ void cpu_init_exception_handling(void)
/* GHCB needs to be setup to handle #VC. */
setup_ghcb();
 
-   /* Finally load the IDT */
-   load_current_idt();
+   if (cpu_feature_enabled(X86_FEATURE_FRED))
+   cpu_init_fred_exceptions();
+   else
+   load_current_idt();
 }
 
 /*
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index c683666876f1..f79c5edc0b89 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -96,7 +97,11 @@ void __init native_init_IRQ(void)
/* Execute any quirks before the call gates are initialised: */
x86_init.irqs.pre_vector_init();
 
-   idt_setup_apic_and_irq_gates();
+   if (cpu_feature_enabled(X86_FEATURE_FRED))
+   fred_complete_exception_setup();
+   else
+   idt_setup_apic_and_irq_gates();
+
lapic_assign_system_vectors();
 
if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 848c85208a57..0ee78a30e14a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1411,7 +1411,10 @@ void __init trap_init(void)
 
/* Initialize TSS before setting up traps so ISTs work */
cpu_init_exception_handling();
+
/* Setup traps as cpu_init() might #GP */
-   idt_setup_traps();
+   if (!cpu_feature_enabled(X86_FEATURE_FRED))
+   idt_setup_traps();
+
cpu_init();
 }
-- 
2.43.0

Re: [PATCH v9 2/2] arm64: boot: Support Flat Image Tree

2023-12-05 Thread Ahmad Fatoum

Hello Simon,

On 02.12.23 04:54, Simon Glass wrote:
> Add a script which produces a Flat Image Tree (FIT), a single file
> containing the built kernel and associated devicetree files.
> Compression defaults to gzip which gives a good balance of size and
> performance.
> 
> The files compress from about 86MB to 24MB using this approach.
> 
> The FIT can be used by bootloaders which support it, such as U-Boot
> and Linuxboot. It permits automatic selection of the correct
> devicetree, matching the compatible string of the running board with
> the closest compatible string in the FIT. There is no need for
> filenames or other workarounds.
> 
> Add a 'make image.fit' build target for arm64, as well. Use
> FIT_COMPRESSION to select a different algorithm.
> 
> The FIT can be examined using 'dumpimage -l'.
> 
> This features requires pylibfdt (use 'pip install libfdt'). It also
> requires compression utilities for the algorithm being used. Supported
> compression options are the same as the Image.xxx files. For now there
> is no way to change the compression other than by editing the rule for
> $(obj)/image.fit
> 
> While FIT supports a ramdisk / initrd, no attempt is made to support
> this here, since it must be built separately from the Linux build.
> 
> Signed-off-by: Simon Glass 

kernel_noload support is now in barebox next branch and I tested this
series against it:

Tested-by: Ahmad Fatoum  # barebox

> +"""Build a FIT containing a lot of devicetree files
> +
> +Usage:
> +make_fit.py -A arm64 -n 'Linux-6.6' -O linux
> +-f arch/arm64/boot/image.fit -k /tmp/kern/arch/arm64/boot/image.itk
> +/tmp/kern/arch/arm64/boot/dts/ -E -c gzip
> +
> +Creates a FIT containing the supplied kernel and a directory containing the
> +devicetree files.
> +
> +Use -E to generate an external FIT (where the data is placed after the
> +FIT data structure). This allows parsing of the data without loading
> +the entire FIT.
> +
> +Use -c to compress the data, using bzip2, gzip, lz4, lzma, lzo and
> +zstd algorithms.
> +
> +The resulting FIT can be booted by bootloaders which support FIT, such
> +as U-Boot, Linuxboot, Tianocore, etc.

Feel free to add barebox to the list. Did you check whether Linuxboot and
Tianocore support kernel_noload?

> +fsw.property_u32('load', 0)
> +fsw.property_u32('entry', 0)

I still think load and entry dummy values are confusing and should be dropped.

> +with fsw.add_node(f'fdt-{seq}'):
> +# Get the compatible / model information
> +with open(fname, 'rb') as inf:
> +data = inf.read()
> +fdt = libfdt.FdtRo(data)
> +model = fdt.getprop(0, 'model').as_str()
> +compat = fdt.getprop(0, 'compatible')
> +
> +fsw.property_string('description', model)
> +fsw.property_string('type', 'flat_dt')
> +fsw.property_string('arch', arch)
> +fsw.property_string('compression', compress)
> +fsw.property('compatible', bytes(compat))
> +
> +with open(fname, 'rb') as inf:
> +compressed = compress_data(inf, compress)
> +fsw.property('data', compressed)
> +return model, compat

After Doug's elaboration, extracting multiple compatibles is fine by me.

Cheers,
Ahmad

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

Re: [PATCH v13 26/35] x86/fred: FRED entry/exit and dispatch code

2023-12-05 Thread Andrew Cooper

On 05/12/2023 10:50 am, Xin Li wrote:
> diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
> new file mode 100644
> index ..215883e90f94
> --- /dev/null
> +++ b/arch/x86/entry/entry_fred.c
> @@ -0,0 +1,230 @@
> ...
> +static noinstr void fred_intx(struct pt_regs *regs)
> +{
> + switch (regs->fred_ss.vector) {
> + /* INT0 */

INTO (for overflow), not INT-zero.  However...

> + case X86_TRAP_OF:
> + exc_overflow(regs);
> + return;
> +
> + /* INT3 */
> + case X86_TRAP_BP:
> + exc_int3(regs);
> + return;

... neither OF nor BP will ever enter fred_intx() because they're type
SWEXC not SWINT.

SWINT is strictly the INT $imm8 instruction.

> ...
> +static noinstr void fred_extint(struct pt_regs *regs)
> +{
> + unsigned int vector = regs->fred_ss.vector;
> +
> + if (WARN_ON_ONCE(vector < FIRST_EXTERNAL_VECTOR))
> + return;
> +
> + if (likely(vector >= FIRST_SYSTEM_VECTOR)) {
> + irqentry_state_t state = irqentry_enter(regs);
> +
> + instrumentation_begin();
> + sysvec_table[vector - FIRST_SYSTEM_VECTOR](regs);

array_index_mask_nospec()

This is easy for an attacker to abuse, to install non-function-pointer
targets into the indirect predictor.

> + instrumentation_end();
> + irqentry_exit(regs, state);
> + } else {
> + common_interrupt(regs, vector);
> + }
> +}
> +
> +static noinstr void fred_exception(struct pt_regs *regs, unsigned long 
> error_code)
> +{
> + /* Optimize for #PF. That's the only exception which matters 
> performance wise */
> + if (likely(regs->fred_ss.vector == X86_TRAP_PF)) {
> + exc_page_fault(regs, error_code);
> + return;
> + }
> +
> + switch (regs->fred_ss.vector) {
> + case X86_TRAP_DE: return exc_divide_error(regs);
> + case X86_TRAP_DB: return fred_exc_debug(regs);
> + case X86_TRAP_BP: return exc_int3(regs);
> + case X86_TRAP_OF: return exc_overflow(regs);

Depending on what you want to do with BP/OF vs fred_intx(), this may
need adjusting.

If you are cross-checking type and vector, then these should be rejected
for not being of type HWEXC.

> + case X86_TRAP_BR: return exc_bounds(regs);
> + case X86_TRAP_UD: return exc_invalid_op(regs);
> + case X86_TRAP_NM: return exc_device_not_available(regs);
> + case X86_TRAP_DF: return exc_double_fault(regs, error_code);
> + case X86_TRAP_TS: return exc_invalid_tss(regs, error_code);
> + case X86_TRAP_NP: return exc_segment_not_present(regs, error_code);
> + case X86_TRAP_SS: return exc_stack_segment(regs, error_code);
> + case X86_TRAP_GP: return exc_general_protection(regs, error_code);
> + case X86_TRAP_MF: return exc_coprocessor_error(regs);
> + case X86_TRAP_AC: return exc_alignment_check(regs, error_code);
> + case X86_TRAP_XF: return exc_simd_coprocessor_error(regs);
> +
> +#ifdef CONFIG_X86_MCE
> + case X86_TRAP_MC: return fred_exc_machine_check(regs);
> +#endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> + case X86_TRAP_VE: return exc_virtualization_exception(regs);
> +#endif
> +#ifdef CONFIG_X86_KERNEL_IBT

CONFIG_X86_CET

Userspace can use CET even if the kernel isn't compiled with IBT, so
this exception needs handling.

> + case X86_TRAP_CP: return exc_control_protection(regs, error_code);
> +#endif
> + default: return fred_bad_type(regs, error_code);
> + }
> +}
> +
> +__visible noinstr void fred_entry_from_user(struct pt_regs *regs)
> +{
> + unsigned long error_code = regs->orig_ax;
> +
> + /* Invalidate orig_ax so that syscall_get_nr() works correctly */
> + regs->orig_ax = -1;
> +
> + switch (regs->fred_ss.type) {
> + case EVENT_TYPE_EXTINT:
> + return fred_extint(regs);
> + case EVENT_TYPE_NMI:
> + return fred_exc_nmi(regs);
> + case EVENT_TYPE_SWINT:
> + return fred_intx(regs);
> + case EVENT_TYPE_HWEXC:
> + case EVENT_TYPE_SWEXC:
> + case EVENT_TYPE_PRIV_SWEXC:
> + return fred_exception(regs, error_code);

PRIV_SWEXC should have it's own function and not fall into fred_exception().

It is strictly only the ICEBP (INT1) instruction at the moment, so
should fall into bad_type() for any vector other than X86_TRAP_DB.

> + case EVENT_TYPE_OTHER:
> + return fred_other(regs);
> + default:
> + return fred_bad_type(regs, error_code);
> + }
> +}

~Andrew

Re: [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs

2023-12-05 Thread Maxim Levitsky

On Fri, 2023-12-01 at 16:31 +, Nicolas Saenz Julienne wrote:
> On Tue Nov 28, 2023 at 7:14 AM UTC, Maxim Levitsky wrote:
> > On Wed, 2023-11-08 at 11:17 +, Nicolas Saenz Julienne wrote:
> > > HVCALL_SEND_IPI and HVCALL_SEND_IPI_EX allow targeting specific a
> > > specific VTL. Honour the requests.
> > > 
> > > Signed-off-by: Nicolas Saenz Julienne 
> > > ---
> > >  arch/x86/kvm/hyperv.c | 24 +---
> > >  arch/x86/kvm/trace.h  | 20 
> > >  include/asm-generic/hyperv-tlfs.h |  6 --
> > >  3 files changed, 33 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > index d4b1b53ea63d..2cf430f6ddd8 100644
> > > --- a/arch/x86/kvm/hyperv.c
> > > +++ b/arch/x86/kvm/hyperv.c
> > > @@ -2230,7 +2230,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, 
> > > struct kvm_hv_hcall *hc)
> > >  }
> > > 
> > >  static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> > > - u64 *sparse_banks, u64 valid_bank_mask)
> > > + u64 *sparse_banks, u64 valid_bank_mask, 
> > > int vtl)
> > >  {
> > >   struct kvm_lapic_irq irq = {
> > >   .delivery_mode = APIC_DM_FIXED,
> > > @@ -2245,6 +2245,9 @@ static void kvm_hv_send_ipi_to_many(struct kvm 
> > > *kvm, u32 vector,
> > >   valid_bank_mask, sparse_banks))
> > >   continue;
> > > 
> > > + if (kvm_hv_get_active_vtl(vcpu) != vtl)
> > > + continue;
> > 
> > Do I understand correctly that this is a temporary limitation?
> > In other words, can a vCPU running in VTL1 send an IPI to a vCPU running 
> > VTL0,
> > forcing the target vCPU to do async switch to VTL1?
> > I think that this is possible.
> 
> The diff is missing some context. See this simplified implementation
> (when all_cpus is set in the parent function):
> 
> static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector, int vtl)
> {
>   [...]
>   kvm_for_each_vcpu(i, vcpu, kvm) {
>   if (kvm_hv_get_active_vtl(vcpu) != vtl)
>   continue;
> 
>   kvm_apic_set_irq(vcpu, &irq, NULL);
>   }
> }
> 
> With the one vCPU per VTL approach, kvm_for_each_vcpu() will iterate
> over *all* vCPUs regardless of their VTL. The IPI is targetted at a
> specific VTL, hence the need to filter.
> 
> VTL1 -> VTL0 IPIs are supported and happen (although they are extremely
> rare).

Makes sense now, thanks!

Best regards,
Maxim Levitsky

> 
> Nicolas
>

Re: [PATCH] Documentaion:trace Add the git web link to view it on the browser

2023-12-05 Thread Steven Rostedt

On Tue,  5 Dec 2023 09:25:17 +0530
Bhaskar Chowdhury  wrote:

> Thought this might help people to see the entire source tree on browser and
> explore.
> 
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  Documentation/trace/ftrace.rst | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
> index 23572f6697c0..e768a4c91452 100644
> --- a/Documentation/trace/ftrace.rst
> +++ b/Documentation/trace/ftrace.rst
> @@ -3731,3 +3731,5 @@ Currently, -mfentry is used by gcc 4.6.0 and above on 
> x86 only.
>  More
>  
>  More details can be found in the source code, in the `kernel/trace/*.c` 
> files.
> +Also you can see the trace source tree on browser `Git Link
> +`__.
> --
> 2.35.8

I'm not a big fan of git links to code in the documentation tree. This is
more for those that want to play with it (compile and install), so I don't
think a web link is useful here. I rather have people download the code and
build it.

-- Steve

RE: [PATCH v13 26/35] x86/fred: FRED entry/exit and dispatch code

2023-12-05 Thread Li, Xin3

> > +static noinstr void fred_intx(struct pt_regs *regs) {
> > +   switch (regs->fred_ss.vector) {
> > +   /* INT0 */
> 
> INTO (for overflow), not INT-zero.  However...
> 
> > +   case X86_TRAP_OF:
> > +   exc_overflow(regs);
> > +   return;
> > +
> > +   /* INT3 */
> > +   case X86_TRAP_BP:
> > +   exc_int3(regs);
> > +   return;
> 
> ... neither OF nor BP will ever enter fred_intx() because they're type SWEXC 
> not
> SWINT.
> 
> SWINT is strictly the INT $imm8 instruction.
> 
> > ...
> > +static noinstr void fred_extint(struct pt_regs *regs) {
> > +   unsigned int vector = regs->fred_ss.vector;
> > +
> > +   if (WARN_ON_ONCE(vector < FIRST_EXTERNAL_VECTOR))
> > +   return;
> > +
> > +   if (likely(vector >= FIRST_SYSTEM_VECTOR)) {
> > +   irqentry_state_t state = irqentry_enter(regs);
> > +
> > +   instrumentation_begin();
> > +   sysvec_table[vector - FIRST_SYSTEM_VECTOR](regs);
> 
> array_index_mask_nospec()
> 
> This is easy for an attacker to abuse, to install non-function-pointer 
> targets into
> the indirect predictor.
> 
> > +   instrumentation_end();
> > +   irqentry_exit(regs, state);
> > +   } else {
> > +   common_interrupt(regs, vector);
> > +   }
> > +}
> > +
> > +static noinstr void fred_exception(struct pt_regs *regs, unsigned
> > +long error_code) {
> > +   /* Optimize for #PF. That's the only exception which matters performance
> wise */
> > +   if (likely(regs->fred_ss.vector == X86_TRAP_PF)) {
> > +   exc_page_fault(regs, error_code);
> > +   return;
> > +   }
> > +
> > +   switch (regs->fred_ss.vector) {
> > +   case X86_TRAP_DE: return exc_divide_error(regs);
> > +   case X86_TRAP_DB: return fred_exc_debug(regs);
> > +   case X86_TRAP_BP: return exc_int3(regs);
> > +   case X86_TRAP_OF: return exc_overflow(regs);
> 
> Depending on what you want to do with BP/OF vs fred_intx(), this may need
> adjusting.
> 
> If you are cross-checking type and vector, then these should be rejected for 
> not
> being of type HWEXC.
> 
> > +   case X86_TRAP_BR: return exc_bounds(regs);
> > +   case X86_TRAP_UD: return exc_invalid_op(regs);
> > +   case X86_TRAP_NM: return exc_device_not_available(regs);
> > +   case X86_TRAP_DF: return exc_double_fault(regs, error_code);
> > +   case X86_TRAP_TS: return exc_invalid_tss(regs, error_code);
> > +   case X86_TRAP_NP: return exc_segment_not_present(regs, error_code);
> > +   case X86_TRAP_SS: return exc_stack_segment(regs, error_code);
> > +   case X86_TRAP_GP: return exc_general_protection(regs, error_code);
> > +   case X86_TRAP_MF: return exc_coprocessor_error(regs);
> > +   case X86_TRAP_AC: return exc_alignment_check(regs, error_code);
> > +   case X86_TRAP_XF: return exc_simd_coprocessor_error(regs);
> > +
> > +#ifdef CONFIG_X86_MCE
> > +   case X86_TRAP_MC: return fred_exc_machine_check(regs); #endif #ifdef
> > +CONFIG_INTEL_TDX_GUEST
> > +   case X86_TRAP_VE: return exc_virtualization_exception(regs);
> > +#endif
> > +#ifdef CONFIG_X86_KERNEL_IBT
> 
> CONFIG_X86_CET
> 
> Userspace can use CET even if the kernel isn't compiled with IBT, so this
> exception needs handling.
> 
> > +   case X86_TRAP_CP: return exc_control_protection(regs, error_code);
> > +#endif
> > +   default: return fred_bad_type(regs, error_code);
> > +   }
> > +}
> > +
> > +__visible noinstr void fred_entry_from_user(struct pt_regs *regs) {
> > +   unsigned long error_code = regs->orig_ax;
> > +
> > +   /* Invalidate orig_ax so that syscall_get_nr() works correctly */
> > +   regs->orig_ax = -1;
> > +
> > +   switch (regs->fred_ss.type) {
> > +   case EVENT_TYPE_EXTINT:
> > +   return fred_extint(regs);
> > +   case EVENT_TYPE_NMI:
> > +   return fred_exc_nmi(regs);
> > +   case EVENT_TYPE_SWINT:
> > +   return fred_intx(regs);
> > +   case EVENT_TYPE_HWEXC:
> > +   case EVENT_TYPE_SWEXC:
> > +   case EVENT_TYPE_PRIV_SWEXC:
> > +   return fred_exception(regs, error_code);
> 
> PRIV_SWEXC should have it's own function and not fall into fred_exception().
> 
> It is strictly only the ICEBP (INT1) instruction at the moment, so should 
> fall into
> bad_type() for any vector other than X86_TRAP_DB.
> 
> > +   case EVENT_TYPE_OTHER:
> > +   return fred_other(regs);
> > +   default:
> > +   return fred_bad_type(regs, error_code);
> > +   }
> > +}
> 
> ~Andrew


Thanks a lot for your quick review, will address soon.
Xin

Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-05 Thread Sean Christopherson

On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > To support this I think that we can add a userspace msr filter on 
> > > > > > the HV_X64_MSR_HYPERCALL,
> > > > > > although I am not 100% sure if a userspace msr filter overrides the 
> > > > > > in-kernel msr handling.
> > > > >
> > > > > I thought about it at the time. It's not that simple though, we should
> > > > > still let KVM set the hypercall bytecode, and other quirks like the 
> > > > > Xen
> > > > > one.
> > > >
> > > > Yeah, that Xen quirk is quite the killer.
> > > >
> > > > Can you provide pseudo-assembly for what the final page is supposed to 
> > > > look like?
> > > > I'm struggling mightily to understand what this is actually trying to 
> > > > do.
> > >
> > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > exists):
> > >
> > > vmcall <-  Offset 0, regular Hyper-V hypercalls enter here
> > > ret
> > > mov rax,rcx  <-  VTL call hypercall enters here
> >
> > I'm missing who/what defines "here" though.  What generates the CALL that 
> > points
> > at this exact offset?  If the exact offset is dictated in the TLFS, then 
> > aren't
> > we screwed with the whole Xen quirk, which inserts 5 bytes before that 
> > first VMCALL?
> 
> Yes, sorry, I should've included some more context.
> 
> Here's a rundown (from memory) of how the first VTL call happens:
>  - CPU0 start running at VTL0.
>  - Hyper-V enables VTL1 on the partition.
>  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
>the initial VTL1 CPU state alongside the enablement hypercall
>arguments.
>  - Hyper-V sets the Hypercall page overlay address through
>HV_X64_MSR_HYPERCALL. KVM fills it.
>  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
>page using the VP Register HvRegisterVsmCodePageOffsets (VP register
>handling is in user-space).

Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets via
a HvSetVpRegisters() hypercall.

I don't see a sane way to handle this in KVM if userspace handles 
HvSetVpRegisters().
E.g. if the guest requests offsets that don't leave enough room for KVM to shove
in its data, then presumably userspace needs to reject HvSetVpRegisters().  But
that requires userspace to know exactly how many bytes KVM is going to write at
each offsets.

My vote is to have userspace do all the patching.  IIUC, all of this is going to
be mutually exclusive with kvm_xen_hypercall_enabled(), i.e. userspace doesn't
need to worry about setting RAX[31].  At that point, it's just VMCALL versus
VMMCALL, and userspace is more than capable of identifying whether its running
on Intel or AMD.

>  - Hyper-V performs the first VTL-call, and has all it needs to move
>between VTL0/1.
> 
> Nicolas

Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-05 Thread Maxim Levitsky

On Tue, 2023-12-05 at 11:21 -0800, Sean Christopherson wrote:
> On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > > To support this I think that we can add a userspace msr filter on 
> > > > > > > the HV_X64_MSR_HYPERCALL,
> > > > > > > although I am not 100% sure if a userspace msr filter overrides 
> > > > > > > the in-kernel msr handling.
> > > > > > 
> > > > > > I thought about it at the time. It's not that simple though, we 
> > > > > > should
> > > > > > still let KVM set the hypercall bytecode, and other quirks like the 
> > > > > > Xen
> > > > > > one.
> > > > > 
> > > > > Yeah, that Xen quirk is quite the killer.
> > > > > 
> > > > > Can you provide pseudo-assembly for what the final page is supposed 
> > > > > to look like?
> > > > > I'm struggling mightily to understand what this is actually trying to 
> > > > > do.
> > > > 
> > > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > > exists):
> > > > 
> > > > vmcall <-  Offset 0, regular Hyper-V hypercalls enter here
> > > > ret
> > > > mov rax,rcx  <-  VTL call hypercall enters here
> > > 
> > > I'm missing who/what defines "here" though.  What generates the CALL that 
> > > points
> > > at this exact offset?  If the exact offset is dictated in the TLFS, then 
> > > aren't
> > > we screwed with the whole Xen quirk, which inserts 5 bytes before that 
> > > first VMCALL?
> > 
> > Yes, sorry, I should've included some more context.
> > 
> > Here's a rundown (from memory) of how the first VTL call happens:
> >  - CPU0 start running at VTL0.
> >  - Hyper-V enables VTL1 on the partition.
> >  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
> >the initial VTL1 CPU state alongside the enablement hypercall
> >arguments.
> >  - Hyper-V sets the Hypercall page overlay address through
> >HV_X64_MSR_HYPERCALL. KVM fills it.
> >  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
> >page using the VP Register HvRegisterVsmCodePageOffsets (VP register
> >handling is in user-space).
> 
> Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets 
> via
> a HvSetVpRegisters() hypercall.

No, you didn't understand this correctly. 

The guest writes the HV_X64_MSR_HYPERCALL, and in the response hyperv fills the 
hypercall page,
including the VSM thunks.

Then the guest can _read_ the offsets, hyperv chose there by issuing another 
hypercall. 

In the current implementation,
the offsets that the kernel choose are first exposed to the userspace via new 
ioctl, and then the userspace
exposes these offsets to the guest via that 'another hypercall' 
(reading a pseudo partition register 'HvRegisterVsmCodePageOffsets')

I personally don't know for sure anymore if the userspace or kernel based 
hypercall page is better
here, it's ugly regardless :(


Best regards,
Maxim Levitsky

> 
> I don't see a sane way to handle this in KVM if userspace handles 
> HvSetVpRegisters().
> E.g. if the guest requests offsets that don't leave enough room for KVM to 
> shove
> in its data, then presumably userspace needs to reject HvSetVpRegisters().  
> But
> that requires userspace to know exactly how many bytes KVM is going to write 
> at
> each offsets.
> 
> My vote is to have userspace do all the patching.  IIUC, all of this is going 
> to
> be mutually exclusive with kvm_xen_hypercall_enabled(), i.e. userspace doesn't
> need to worry about setting RAX[31].  At that point, it's just VMCALL versus
> VMMCALL, and userspace is more than capable of identifying whether its running
> on Intel or AMD.
> 
> >  - Hyper-V performs the first VTL-call, and has all it needs to move
> >between VTL0/1.
> > 
> > Nicolas

Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-05 Thread Sean Christopherson

On Tue, Dec 05, 2023, Maxim Levitsky wrote:
> On Tue, 2023-12-05 at 11:21 -0800, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > > > To support this I think that we can add a userspace msr filter 
> > > > > > > > on the HV_X64_MSR_HYPERCALL,
> > > > > > > > although I am not 100% sure if a userspace msr filter overrides 
> > > > > > > > the in-kernel msr handling.
> > > > > > > 
> > > > > > > I thought about it at the time. It's not that simple though, we 
> > > > > > > should
> > > > > > > still let KVM set the hypercall bytecode, and other quirks like 
> > > > > > > the Xen
> > > > > > > one.
> > > > > > 
> > > > > > Yeah, that Xen quirk is quite the killer.
> > > > > > 
> > > > > > Can you provide pseudo-assembly for what the final page is supposed 
> > > > > > to look like?
> > > > > > I'm struggling mightily to understand what this is actually trying 
> > > > > > to do.
> > > > > 
> > > > > I'll make it as simple as possible (diregard 32bit support and that 
> > > > > xen
> > > > > exists):
> > > > > 
> > > > > vmcall <-  Offset 0, regular Hyper-V hypercalls enter here
> > > > > ret
> > > > > mov rax,rcx  <-  VTL call hypercall enters here
> > > > 
> > > > I'm missing who/what defines "here" though.  What generates the CALL 
> > > > that points
> > > > at this exact offset?  If the exact offset is dictated in the TLFS, 
> > > > then aren't
> > > > we screwed with the whole Xen quirk, which inserts 5 bytes before that 
> > > > first VMCALL?
> > > 
> > > Yes, sorry, I should've included some more context.
> > > 
> > > Here's a rundown (from memory) of how the first VTL call happens:
> > >  - CPU0 start running at VTL0.
> > >  - Hyper-V enables VTL1 on the partition.
> > >  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
> > >the initial VTL1 CPU state alongside the enablement hypercall
> > >arguments.
> > >  - Hyper-V sets the Hypercall page overlay address through
> > >HV_X64_MSR_HYPERCALL. KVM fills it.
> > >  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
> > >page using the VP Register HvRegisterVsmCodePageOffsets (VP register
> > >handling is in user-space).
> > 
> > Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets 
> > via
> > a HvSetVpRegisters() hypercall.
> 
> No, you didn't understand this correctly. 
> 
> The guest writes the HV_X64_MSR_HYPERCALL, and in the response hyperv fills

When people say "Hyper-V", do y'all mean "root partition"?  If so, can we just
say "root partition"?  Part of my confusion is that I don't instinctively know
whether things like "Hyper-V enables VTL1 on the partition" are talking about 
the
root partition (or I guess parent partition?) or the hypervisor.  Functionally 
it
probably doesn't matter, it's just hard to reconcile things with the TLFS, which
is written largely to describe the hypervisor's behavior.

> the hypercall page, including the VSM thunks.
>
> Then the guest can _read_ the offsets, hyperv chose there by issuing another 
> hypercall. 

Hrm, now I'm really confused.  Ah, the TLFS contradicts itself.  The blurb for
AccessVpRegisters says:

  The partition can invoke the hypercalls HvSetVpRegisters and HvGetVpRegisters.

And HvSetVpRegisters confirms that requirement:

  The caller must either be the parent of the partition specified by 
PartitionId,
  or the partition specified must be “self” and the partition must have the
  AccessVpRegisters privilege

But it's absent from HvGetVpRegisters:

  The caller must be the parent of the partition specified by PartitionId or the
  partition specifying its own partition ID.

> In the current implementation, the offsets that the kernel choose are first
> exposed to the userspace via new ioctl, and then the userspace exposes these
> offsets to the guest via that 'another hypercall' (reading a pseudo partition
> register 'HvRegisterVsmCodePageOffsets')
> 
> I personally don't know for sure anymore if the userspace or kernel based
> hypercall page is better here, it's ugly regardless :(

Hrm.  Requiring userspace to intercept the WRMSR will be a mess because then KVM
will have zero knowledge of the hypercall page, e.g. userspace would be forced 
to
intercept HV_X64_MSR_GUEST_OS_ID as well.  That's not the end of the world, but
it's not exactly ideal either.

What if we exit to userspace with a new kvm_hyperv_exit reason that requires
completion?  I.e. punt to userspace if VSM is enabled, but still record the data
in KVM?  Ugh, but even that's a mess because kvm_hv_set_msr_pw() is deep in the
WRMSR emulation call stack and can't easily signal that an exit to userspace is
needed.  Blech.

Re: [PATCH v9 2/2] arm64: boot: Support Flat Image Tree

2023-12-05 Thread Simon Glass

Hi Ahmad,

On Tue, 5 Dec 2023 at 04:48, Ahmad Fatoum  wrote:
>
> Hello Simon,
>
> On 02.12.23 04:54, Simon Glass wrote:
> > Add a script which produces a Flat Image Tree (FIT), a single file
> > containing the built kernel and associated devicetree files.
> > Compression defaults to gzip which gives a good balance of size and
> > performance.
> >
> > The files compress from about 86MB to 24MB using this approach.
> >
> > The FIT can be used by bootloaders which support it, such as U-Boot
> > and Linuxboot. It permits automatic selection of the correct
> > devicetree, matching the compatible string of the running board with
> > the closest compatible string in the FIT. There is no need for
> > filenames or other workarounds.
> >
> > Add a 'make image.fit' build target for arm64, as well. Use
> > FIT_COMPRESSION to select a different algorithm.
> >
> > The FIT can be examined using 'dumpimage -l'.
> >
> > This features requires pylibfdt (use 'pip install libfdt'). It also
> > requires compression utilities for the algorithm being used. Supported
> > compression options are the same as the Image.xxx files. For now there
> > is no way to change the compression other than by editing the rule for
> > $(obj)/image.fit
> >
> > While FIT supports a ramdisk / initrd, no attempt is made to support
> > this here, since it must be built separately from the Linux build.
> >
> > Signed-off-by: Simon Glass 
>
> kernel_noload support is now in barebox next branch and I tested this
> series against it:
>
> Tested-by: Ahmad Fatoum  # barebox
>

OK great thank you.

> > +"""Build a FIT containing a lot of devicetree files
> > +
> > +Usage:
> > +make_fit.py -A arm64 -n 'Linux-6.6' -O linux
> > +-f arch/arm64/boot/image.fit -k /tmp/kern/arch/arm64/boot/image.itk
> > +/tmp/kern/arch/arm64/boot/dts/ -E -c gzip
> > +
> > +Creates a FIT containing the supplied kernel and a directory containing the
> > +devicetree files.
> > +
> > +Use -E to generate an external FIT (where the data is placed after the
> > +FIT data structure). This allows parsing of the data without loading
> > +the entire FIT.
> > +
> > +Use -c to compress the data, using bzip2, gzip, lz4, lzma, lzo and
> > +zstd algorithms.
> > +
> > +The resulting FIT can be booted by bootloaders which support FIT, such
> > +as U-Boot, Linuxboot, Tianocore, etc.
>
> Feel free to add barebox to the list. Did you check whether Linuxboot and
> Tianocore support kernel_noload?

Only what I was told by people in those projects. They may not even
look at the load address, but I am not an expert on that.

>
> > +fsw.property_u32('load', 0)
> > +fsw.property_u32('entry', 0)
>
> I still think load and entry dummy values are confusing and should be dropped.

This is what the spec requires at present. But I agree we should
change it. I will dig into that at some point to see what is needed.

>
> > +with fsw.add_node(f'fdt-{seq}'):
> > +# Get the compatible / model information
> > +with open(fname, 'rb') as inf:
> > +data = inf.read()
> > +fdt = libfdt.FdtRo(data)
> > +model = fdt.getprop(0, 'model').as_str()
> > +compat = fdt.getprop(0, 'compatible')
> > +
> > +fsw.property_string('description', model)
> > +fsw.property_string('type', 'flat_dt')
> > +fsw.property_string('arch', arch)
> > +fsw.property_string('compression', compress)
> > +fsw.property('compatible', bytes(compat))
> > +
> > +with open(fname, 'rb') as inf:
> > +compressed = compress_data(inf, compress)
> > +fsw.property('data', compressed)
> > +return model, compat
>
> After Doug's elaboration, extracting multiple compatibles is fine by me.

OK good.

Regards,
Simon

RE: [PATCH v13 26/35] x86/fred: FRED entry/exit and dispatch code

2023-12-05 Thread Li, Xin3

> > diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
> > new file mode 100644 index ..215883e90f94
> > --- /dev/null
> > +++ b/arch/x86/entry/entry_fred.c
> > @@ -0,0 +1,230 @@
> > ...
> > +static noinstr void fred_intx(struct pt_regs *regs) {
> > +   switch (regs->fred_ss.vector) {
> > +   /* INT0 */
> 
> INTO (for overflow), not INT-zero.  However...

My bad again...

> > +   case X86_TRAP_OF:
> > +   exc_overflow(regs);
> > +   return;
> > +
> > +   /* INT3 */
> > +   case X86_TRAP_BP:
> > +   exc_int3(regs);
> > +   return;
> 
> ... neither OF nor BP will ever enter fred_intx() because they're type SWEXC 
> not
> SWINT.

Per FRED spec 5.0, section 7.3 Software Interrupts and Related Instructions:
INT n (opcode CD followed by an immediate byte): There are 256 such
software interrupt instructions, one for each value n of the immediate
byte (0–255).

And appendix B Event Stack Levels:
If the event is an execution of INT n (opcode CD n for 8-bit value n),
the event stack level is 0. The event type is 4 (software interrupt)
and the vector is n.

So int $0x4 and int $0x3 (use asm(".byte 0xCD, 0x03")) get here.

But into (0xCE) and int3 (0xCC) do use event type SWEXC. 

BTW, into is NOT allowed in 64-bit mode but "int $0x4" is allowed.

> 
> SWINT is strictly the INT $imm8 instruction.
> 
> > ...
> > +static noinstr void fred_extint(struct pt_regs *regs) {
> > +   unsigned int vector = regs->fred_ss.vector;
> > +
> > +   if (WARN_ON_ONCE(vector < FIRST_EXTERNAL_VECTOR))
> > +   return;
> > +
> > +   if (likely(vector >= FIRST_SYSTEM_VECTOR)) {
> > +   irqentry_state_t state = irqentry_enter(regs);
> > +
> > +   instrumentation_begin();
> > +   sysvec_table[vector - FIRST_SYSTEM_VECTOR](regs);
> 
> array_index_mask_nospec()
> 
> This is easy for an attacker to abuse, to install non-function-pointer 
> targets into
> the indirect predictor.

HPA did use array_index_nospec() at the beginning, but I forgot it later.

> 
> > +   instrumentation_end();
> > +   irqentry_exit(regs, state);
> > +   } else {
> > +   common_interrupt(regs, vector);
> > +   }
> > +}
> > +
> > +static noinstr void fred_exception(struct pt_regs *regs, unsigned
> > +long error_code) {
> > +   /* Optimize for #PF. That's the only exception which matters performance
> wise */
> > +   if (likely(regs->fred_ss.vector == X86_TRAP_PF)) {
> > +   exc_page_fault(regs, error_code);
> > +   return;
> > +   }
> > +
> > +   switch (regs->fred_ss.vector) {
> > +   case X86_TRAP_DE: return exc_divide_error(regs);
> > +   case X86_TRAP_DB: return fred_exc_debug(regs);
> > +   case X86_TRAP_BP: return exc_int3(regs);
> > +   case X86_TRAP_OF: return exc_overflow(regs);
> 
> Depending on what you want to do with BP/OF vs fred_intx(), this may need
> adjusting.
> 
> If you are cross-checking type and vector, then these should be rejected for 
> not
> being of type HWEXC.

You're right, the event type needs to be SWEXC for into and int3.

However, would it be overkilling?  Assuming hardware and VMM are sane.

> 
> > +   case X86_TRAP_BR: return exc_bounds(regs);
> > +   case X86_TRAP_UD: return exc_invalid_op(regs);
> > +   case X86_TRAP_NM: return exc_device_not_available(regs);
> > +   case X86_TRAP_DF: return exc_double_fault(regs, error_code);
> > +   case X86_TRAP_TS: return exc_invalid_tss(regs, error_code);
> > +   case X86_TRAP_NP: return exc_segment_not_present(regs, error_code);
> > +   case X86_TRAP_SS: return exc_stack_segment(regs, error_code);
> > +   case X86_TRAP_GP: return exc_general_protection(regs, error_code);
> > +   case X86_TRAP_MF: return exc_coprocessor_error(regs);
> > +   case X86_TRAP_AC: return exc_alignment_check(regs, error_code);
> > +   case X86_TRAP_XF: return exc_simd_coprocessor_error(regs);
> > +
> > +#ifdef CONFIG_X86_MCE
> > +   case X86_TRAP_MC: return fred_exc_machine_check(regs); #endif #ifdef
> > +CONFIG_INTEL_TDX_GUEST
> > +   case X86_TRAP_VE: return exc_virtualization_exception(regs);
> > +#endif
> > +#ifdef CONFIG_X86_KERNEL_IBT
> 
> CONFIG_X86_CET
> 
> Userspace can use CET even if the kernel isn't compiled with IBT, so this
> exception needs handling.

Absolutely correct!

> 
> > +   case X86_TRAP_CP: return exc_control_protection(regs, error_code);
> > +#endif
> > +   default: return fred_bad_type(regs, error_code);
> > +   }
> > +}
> > +
> > +__visible noinstr void fred_entry_from_user(struct pt_regs *regs) {
> > +   unsigned long error_code = regs->orig_ax;
> > +
> > +   /* Invalidate orig_ax so that syscall_get_nr() works correctly */
> > +   regs->orig_ax = -1;
> > +
> > +   switch (regs->fred_ss.type) {
> > +   case EVENT_TYPE_EXTINT:
> > +   return fred_extint(regs);
> > +   case EVENT_TYPE_NMI:
> > +   return fred_exc_nmi(regs);
> > +   case EVENT_TYPE_SWINT:
> > +   return fred_intx(regs);
> > +   case EVENT_TYPE_HWEXC:
> > +   case EVENT_TYP

46 matches

Mail list logo