https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #3 from cuilili <lili.cui at intel dot com> ---
I reproduced S1244 regression on znver3.

Src code:

for (int i = 0; i < LEN_1D-1; i++)
  {
    a[i] = b[i] + c[i] * c[i] + b[i] * b[i] + c[i];
    d[i] = a[i] + a[i+1];
  }
--------------------------------------------------------
Base version:                     Base + commit version:            

Assembler                         Assembler                         
Loop1:                            Loop1:                            
vmovsd 0x60c400(%rax),%xmm2       vmovsd 0x60ba00(%rax),%xmm2       
vmovsd 0x60ba00(%rax),%xmm1       vmovsd 0x60c400(%rax),%xmm1       
add    $0x8,%rax                  add    $0x8,%rax                  
--------------------------------------------------------------------
vaddsd %xmm1,%xmm2,%xmm0          vmovsd %xmm2,%xmm2,%xmm0          
vmulsd %xmm2,%xmm2,%xmm2          vfmadd132sd %xmm2,%xmm1,%xmm0     
vfmadd132sd %xmm1,%xmm2,%xmm1     vfmadd132sd %xmm1,%xmm2,%xmm1     
--------------------------------------------------------------------
vaddsd %xmm1,%xmm0,%xmm0          vaddsd %xmm1,%xmm0,%xmm0          
vmovsd %xmm0,0x60cdf8(%rax)       vmovsd %xmm0,0x60cdf8(%rax)       
vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 
vmovsd %xmm0,0x60aff8(%rax)       vmovsd %xmm0,0x60aff8(%rax)       
cmp    $0x9f8,%rax                cmp    $0x9f8,%rax                
jne    Loop1:                     jne    Loop1        


For the Base version, mult and FMA have dependencies, which increases the
latency of the critical dependency chain. I didn't find out why znver3 has
regression. Same binary running on ICX has 11% gain (with #define iterations
100000000).

Reply via email to