https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148
--- Comment #3 from cuilili <lili.cui at intel dot com> --- I reproduced S1244 regression on znver3. Src code: for (int i = 0; i < LEN_1D-1; i++) { a[i] = b[i] + c[i] * c[i] + b[i] * b[i] + c[i]; d[i] = a[i] + a[i+1]; } -------------------------------------------------------- Base version: Base + commit version: Assembler Assembler Loop1: Loop1: vmovsd 0x60c400(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm1 vmovsd 0x60c400(%rax),%xmm1 add $0x8,%rax add $0x8,%rax -------------------------------------------------------------------- vaddsd %xmm1,%xmm2,%xmm0 vmovsd %xmm2,%xmm2,%xmm0 vmulsd %xmm2,%xmm2,%xmm2 vfmadd132sd %xmm2,%xmm1,%xmm0 vfmadd132sd %xmm1,%xmm2,%xmm1 vfmadd132sd %xmm1,%xmm2,%xmm1 -------------------------------------------------------------------- vaddsd %xmm1,%xmm0,%xmm0 vaddsd %xmm1,%xmm0,%xmm0 vmovsd %xmm0,0x60cdf8(%rax) vmovsd %xmm0,0x60cdf8(%rax) vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vmovsd %xmm0,0x60aff8(%rax) vmovsd %xmm0,0x60aff8(%rax) cmp $0x9f8,%rax cmp $0x9f8,%rax jne Loop1: jne Loop1 For the Base version, mult and FMA have dependencies, which increases the latency of the critical dependency chain. I didn't find out why znver3 has regression. Same binary running on ICX has 11% gain (with #define iterations 100000000).