FLOAD)

LLVM Bugs via llvm-bugs Mon, 18 Aug 2025 11:57:45 -0700

Issue	154181
Summary	[X86][Sched][AlderlakeP] Align vector load/use gap to 5c and retune FP ports/latencies (FADD/FMA/FLOGIC/FSHUFFLE/FLOAD)
Labels	new issue
Assignees
Reporter	ms178

    **Disclaimer:** I am enthusiast user with absolutely no programming background but with a long interest in compilers and other open source projects and recently began tinkering with some parts of the LLVM code base with AI. I found some things that might be of interest to the LLVM community. From the "[Contributing to LLVM](https://llvm.org/docs/Contributing.html)" guidline, I take it that bug reports/issues are welcome, even when AI did the heavy lifting (at least there is no explicit AI exclusion mentioned there). As I ran into issues with some LLVM developers in the recent past who were strictly opposed to being confronted with AI usage, I will make it clear to stop filing such issues if the LLVM community has no interest to be confronted with these.


I am filing my findings as "Issues" and not as MR's for now as someone with actual programming skills might be better suited to bring these over the finish line in code review.

**AI tool used:** GPT5-High

**Patch:**

```
>From ac03e23de9ede73b705035c347e11e2b57c4183d Mon Sep 17 00:00:00 2001
From: ms178 <m.seyfa...@gmail.com>
Date: Sun, 17 Aug 2025 17:07:48 +0200
Subject: [PATCH] [X86][Sched][AlderlakeP] Align vector load/use gap to 5c and
 retune FP ports/latencies (FADD/FMA/FLOGIC/FSHUFFLE/FLOAD)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Golden Cove-class P-cores (Alder/Raptor Lake) have an L1D load-to-use gap of ~5 cycles and FP pipelines that service FP add/logic on the primary FP/ALU ports (00/01), with classic latencies (FADD=3c, FMA=4c). The prior AlderlakeP model overestimated vector load read-advance (6c) and routed some FP ops (e.g., FADD) through Port05, inflating latency/port pressure and skewing scheduling. This patch corrects those specifics to match public data (Intel Opt. Manual, Agner Fog, uops.info), improving model fidelity and scheduling decisions for AVX(2) workloads typical of client CPUs.

Changes
- ReadAdvance:
  - ReadAfterVecLd/ReadAfterVecXLd/ReadAfterVecYLd: 6 → 5 cycles.
- FP add:
  - WriteFAdd now issues on ADLPPort00_01 (was Port05), Latency = 3 (unchanged).
  - WriteFAddLd ports adjusted to [ADLPPort00_01, ADLPPort02_03_10], Latency 10 → 8.
  - WriteFAdd64{,X,Y} retuned to ADLPPort00_01 with consistent folded-load penalties (5c).
  - WriteFAdd{X,Y} similarly retuned to ADLPPort00_01 with 5c load penalties.
- FP load:
  - WriteFLoad{,X,Y} Latency: set uniformly to 5c (was 7/7/8).
- FP logic/blend:
  - WriteFLogic{,Y} folded-load penalties: 5c (was 7/8); ports remain [ADLPPort00_01_05].
  - WriteFBlend{,Y} folded-load penalties: 5c (was 7/8).
- FMA/MAX:
  - WriteFMA{,X,Y} and WriteFMAX{,Y} folded-load penalties: 5c (was 7/8); ports remain [ADLPPort00_01], core latency 4c unchanged.
- FP shuffle:
  - WriteFShuffle{,Y} and WriteFShuffle256 folded-load penalties: 5c (was 7/8).
- No EVEX/ZMM enablement; AVX-512 remains untouched for client parts.

Impact
- Schedules reflect a 5-cycle L1D load-to-use gap for vectors and correct FP pipeline usage (00/01) on Golden Cove, reducing artificial port pressure on Port05.
- More accurate instruction timing and uop placement for FP-heavy AVX(2) kernels (physics, audio, image processing) and better throughput predictions with llvm-mca.

Testing
- Compiling the Kernel and Mesa with this (and some other) changes lead to consistent performance improvements in Total War: Troy on my 14700KF.

Signed-off: Marcus Seyfarth <m.seyfa...@gmail.com>
---
 llvm/lib/Target/X86/X86SchedAlderlakeP.td | 55 ++++++++++-------------
 1 file changed, 24 insertions(+), 31 deletions(-)

diff --git a/llvm/lib/Target/X86/X86SchedAlderlakeP.td b/llvm/lib/Target/X86/X86SchedAlderlakeP.td
index 564369804711a..0d4934879a24a 100644
--- a/llvm/lib/Target/X86/X86SchedAlderlakeP.td
+++ b/llvm/lib/Target/X86/X86SchedAlderlakeP.td
@@ -87,11 +87,10 @@ def ADLPPortAny : ProcResGroup<[ADLPPort00, ADLPPort01, ADLPPort02, ADLPPort03,
 // until 5 cycles after the memory operand.
 def : ReadAdvance<ReadAfterLd, 5>;
 
-// Vector loads are 6 cycles, so ReadAfterVec*Ld registers needn't be available
-// until 6 cycles after the memory operand.
-def : ReadAdvance<ReadAfterVecLd, 6>;
-def : ReadAdvance<ReadAfterVecXLd, 6>;
-def : ReadAdvance<ReadAfterVecYLd, 6>;
+// Dependent-use read-advance after vector loads: 5 cycles
+def : ReadAdvance<ReadAfterVecLd, 5>;
+def : ReadAdvance<ReadAfterVecXLd, 5>;
+def : ReadAdvance<ReadAfterVecYLd, 5>;
 
 def : ReadAdvance<ReadInt2Fpu, 0>;
 
@@ -208,19 +207,19 @@ defm : ADLPWriteResPair<WriteDiv64, [ADLPPort01], 18, [3], 3>;
 defm : X86WriteRes<WriteDiv8, [ADLPPort01], 17, [3], 3>;
 defm : X86WriteRes<WriteDiv8Ld, [ADLPPort01], 22, [3], 3>;
 defm : X86WriteRes<WriteEMMS, [ADLPPort00, ADLPPort00_05, ADLPPort00_06], 10, [1, 8, 1], 10>;
-def : WriteRes<WriteFAdd, [ADLPPort05]> {
+def : WriteRes<WriteFAdd, [ADLPPort00_01]> {
   let Latency = 3;
 }
-defm : X86WriteRes<WriteFAddLd, [ADLPPort01_05, ADLPPort02_03_10], 10, [1, 1], 2>;
-defm : ADLPWriteResPair<WriteFAdd64, [ADLPPort01_05], 3, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFAdd64X, [ADLPPort01_05], 3, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFAdd64Y, [ADLPPort01_05], 3, [1], 1, 8>;
+defm : X86WriteRes<WriteFAddLd, [ADLPPort00_01, ADLPPort02_03_10], 8, [1, 1], 2>;
+defm : ADLPWriteResPair<WriteFAdd64,  [ADLPPort00_01], 3, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFAdd64X, [ADLPPort00_01], 3, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFAdd64Y, [ADLPPort00_01], 3, [1], 1, 5>;
 defm : X86WriteResPairUnsupported<WriteFAdd64Z>;
-defm : ADLPWriteResPair<WriteFAddX, [ADLPPort01_05], 3, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFAddY, [ADLPPort01_05], 3, [1], 1, 8>;
+defm : ADLPWriteResPair<WriteFAddX, [ADLPPort00_01], 3, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFAddY, [ADLPPort00_01], 3, [1], 1, 5>;
 defm : X86WriteResPairUnsupported<WriteFAddZ>;
-defm : ADLPWriteResPair<WriteFBlend, [ADLPPort00_01_05], 1, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFBlendY, [ADLPPort00_01_05], 1, [1], 1, 8>;
+defm : ADLPWriteResPair<WriteFBlend,  [ADLPPort00_01_05], 1, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFBlendY, [ADLPPort00_01_05], 1, [1], 1, 5>;
 def : WriteRes<WriteFCMOV, [ADLPPort01]> {
   let Latency = 3;
 }
@@ -248,21 +247,15 @@ defm : ADLPWriteResPair<WriteFHAddY, [ADLPPort01_05, ADLPPort05], 5, [1, 2], 3,
 def : WriteRes<WriteFLD0, [ADLPPort00_05]>;
 defm : X86WriteRes<WriteFLD1, [ADLPPort00_05], 1, [2], 2>;
 defm : X86WriteRes<WriteFLDC, [ADLPPort00_05], 1, [2], 2>;
-def : WriteRes<WriteFLoad, [ADLPPort02_03_10]> {
-  let Latency = 7;
-}
-def : WriteRes<WriteFLoadX, [ADLPPort02_03_10]> {
-  let Latency = 7;
-}
-def : WriteRes<WriteFLoadY, [ADLPPort02_03_10]> {
-  let Latency = 8;
-}
-defm : ADLPWriteResPair<WriteFLogic, [ADLPPort00_01_05], 1, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFLogicY, [ADLPPort00_01_05], 1, [1], 1, 8>;
+def : WriteRes<WriteFLoad,  [ADLPPort02_03_10]> { let Latency = 5; }
+def : WriteRes<WriteFLoadX, [ADLPPort02_03_10]> { let Latency = 5; }
+def : WriteRes<WriteFLoadY, [ADLPPort02_03_10]> { let Latency = 5; }
+defm : ADLPWriteResPair<WriteFLogic,  [ADLPPort00_01_05], 1, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFLogicY, [ADLPPort00_01_05], 1, [1], 1, 5>;
 defm : X86WriteResPairUnsupported<WriteFLogicZ>;
-defm : ADLPWriteResPair<WriteFMA, [ADLPPort00_01], 4, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFMAX, [ADLPPort00_01], 4, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFMAY, [ADLPPort00_01], 4, [1], 1, 8>;
+defm : ADLPWriteResPair<WriteFMA,  [ADLPPort00_01], 4, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFMAX, [ADLPPort00_01], 4, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFMAY, [ADLPPort00_01], 4, [1], 1, 5>;
 defm : X86WriteResPairUnsupported<WriteFMAZ>;
 def : WriteRes<WriteFMOVMSK, [ADLPPort00]> {
   let Latency = 3;
@@ -295,9 +288,9 @@ defm : ADLPWriteResPair<WriteFRsqrt, [ADLPPort00], 4, [1], 1, 7>;
 defm : ADLPWriteResPair<WriteFRsqrtX, [ADLPPort00], 4, [1], 1, 7>;
 defm : ADLPWriteResPair<WriteFRsqrtY, [ADLPPort00], 4, [1], 1, 8>;
 defm : X86WriteResPairUnsupported<WriteFRsqrtZ>;
-defm : ADLPWriteResPair<WriteFShuffle, [ADLPPort05], 1, [1], 1, 7>;
-defm : ADLPWriteResPair<WriteFShuffle256, [ADLPPort05], 3, [1], 1, 8>;
-defm : ADLPWriteResPair<WriteFShuffleY, [ADLPPort05], 1, [1], 1, 8>;
+defm : ADLPWriteResPair<WriteFShuffle,    [ADLPPort05], 1, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFShuffle256, [ADLPPort05], 3, [1], 1, 5>;
+defm : ADLPWriteResPair<WriteFShuffleY,   [ADLPPort05], 1, [1], 1, 5>;
 defm : X86WriteResPairUnsupported<WriteFShuffleZ>;
 def : WriteRes<WriteFSign, [ADLPPort00]>;
 defm : ADLPWriteResPair<WriteFSqrt, [ADLPPort00], 12, [1], 1, 7>;
```

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 154181] [X86][Sched][AlderlakeP] Align vector load/use gap to 5c and retune FP ports/latencies (FADD/FMA/FLOGIC/FSHUFFLE/FLOAD)

Reply via email to