[Bug target/117562] [15 Regression] 40% slowdown of 482.sphinx3 on Zen4, Zen5 since r15-5120-g9a62c149589103

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 21 Nov 2024 05:53:36 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117562


--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
 34.37%        419183  sphinx_livepret 
sphinx_livepretend_peak.amd64-m64-gcc42-nn  [.] utt_decode_block.constprop.0
  17.28%        212804  sphinx_livepret 
sphinx_livepretend_base.amd64-m64-gcc42-nn  [.] utt_decode_block.constprop.0

it's the inlined copy of vector_gautbl_eval_logs3 where we now seem to
hit vectorized code for

        for (i = 0; i < veclen; i++) {
            diff1 = x[i] - m1[i];
            dval1 -= diff1 * diff1 * v1[i];
            diff2 = x[i] - m2[i];
            dval2 -= diff2 * diff2 * v2[i];
        }

never for the zmm version, few cycles on the ymm and now very many on
the xmm version

   772 │        vmovaps       -0xc0(%rbp),%xmm6                                
                                                            ▒
     4 │        vmulpd        %xmm13,%xmm13,%xmm13                             
                                                            ▒
       │        vmulpd        %xmm1,%xmm1,%xmm1                                
                                                            ▒
102664 │        vfnmadd132pd  %xmm14,%xmm3,%xmm13                              
                                                            ▒
     1 │        vcvtps2pd     %xmm6,%xmm3                                      
                                                            ▒
       │      diff2 = x[i] - m2[i];                                            
                                                            ▒
     3 │        vmovaps       -0xd0(%rbp),%xmm6                                
                                                            ▒
       │      dval1 -= diff1 * diff1 * v1[i];                                  
                                                            ▒
 95472 │        vfnmadd132pd  %xmm1,%xmm13,%xmm3

We end up also applying basic-block
vectorization to the scalar loop via the stores after it:

        if (dval1 < gautbl->distfloor)
            dval1 = gautbl->distfloor;
        if (dval2 < gautbl->distfloor)
            dval2 = gautbl->distfloor;

        score[r] = (int32)(f * dval1);
        score[r+1] = (int32)(f * dval2);

interestingly without LTO this specific instance doesn't behave that badly
but is faster with the three epilogues:

  12.63%        132184  sphinx_livepret 
sphinx_livepretend_base.amd64-m64-gcc42-nn  [.] vector_gautbl_eval_logs3       
                   ▒
  11.55%        119447  sphinx_livepret 
sphinx_livepretend_peak.amd64-m64-gcc42-nn  [.] vector_gautbl_eval_logs3 

in turn a similar case elsewhere shows up:

  27.63%        285537  sphinx_livepret 
sphinx_livepretend_peak.amd64-m64-gcc42-nn  [.] mgau_eval                      
                   ◆
  15.19%        158646  sphinx_livepret 
sphinx_livepretend_base.amd64-m64-gcc42-nn  [.] mgau_eval                       

that's basically exactly the same loop kernel.

What we can see here is

   130 │       vmovhps       %xmm4,0x60(%rsp)                                  
                                                            ▒
     5 │       vmovhps       %xmm2,0x70(%rsp)                                  
                                                            ▒
       │       vmovaps       0x70(%rsp),%xmm0                                  
                                                            ▒
 10552 │       vcvtps2pd     %xmm4,%xmm6                                       
                                                            ▒
  1204 │       vcvtps2pd     %xmm2,%xmm3                                       
                                                            ▒
       │       vcvtps2pd     %xmm0,%xmm2                                       
                                                            ▒
  6872 │       vmovaps       0x60(%rsp),%xmm0                                  
                                                            ▒
  1598 │       vmulpd        %xmm3,%xmm3,%xmm3                                 
                                                            ▒
     5 │       vmulpd        %xmm2,%xmm2,%xmm2                                 
                                                            ▒
 15119 │       vfnmadd132pd  %xmm6,%xmm1,%xmm3                                 
                                                            ▒
    38 │       vcvtps2pd     %xmm0,%xmm1                                       
                                                            ▒
  5088 │       vfnmadd132pd  %xmm2,%xmm3,%xmm1                                 
                                                            ▒
 33818 │       vunpckhpd     %xmm1,%xmm1,%xmm2                                 
                                                            ▒
 39938 │       vaddpd        %xmm1,%xmm2,%xmm2                                 
                                                            ▒
 37932 │       test          $0x3,%cl          

There is an odd spill/reload at 0x60(%rsp) which I can't see where it's coming
from.

Could it be that V2SFmode is spilled as V2SFmode but reloaded as V4SFmode?!

#(insn:TI 1161 1175 1578 15 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
#        (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                        (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
#                (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]))
#            (parallel [
#                    (const_int 6 [0x6])
#                    (const_int 7 [0x7])
#                    (const_int 2 [0x2])
#                    (const_int 3 [0x3])
#                ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
#     (expr_list:REG_DEAD (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])
#        (nil)))
        vmovhps %xmm2, 112(%rsp)        # 1161  [c=4 l=8]  sse_movhlps/4
#(insn 1578 1161 1173 15 (set (reg:V4SF 20 xmm0 [853])
#        (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]))
"cont_mgau.c":157:9 2354 {movv4sf_internal}
#     (nil))
        vmovaps 112(%rsp), %xmm0        # 1578  [c=10 l=8]  movv4sf_internal/3

Huh.  It looks like this is from a V4SF -> 2xV2DF extension via
vec_unpack_{hi,lo}_expr.

Originally this is

(insn 1161 1160 1162 58 (set (reg:V4SF 853)
        (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853)
                (reg:V4SF 566 [ vect__811.229 ]))
            (parallel [
                    (const_int 6 [0x6])
                    (const_int 7 [0x7])
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
     (expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ])
        (nil)))

but LRA chooses the memory alternative for the destination:

      Choosing alt 4 in insn 1161:  (0) m  (1) 0  (2) v {sse_movhlps}
         Considering alt=0 of insn 1162:   (0) =v  (1) v
          overall=6,losers=1,rld_nregs=1
      Choosing alt 0 in insn 1162:  (0) =v  (1) v {sse2_cvtps2pd}
      Creating newreg=998 from oldreg=853, assigning class ALL_SSE_REGS to r998
 1162: r568:V2DF=float_extend(vec_select(r998:V4SF,parallel))
    Inserting insn reload before:
 1578: r998:V4SF=r853:V4SF


This is the following pattern:

(define_insn "sse_movhlps"
  [(set (match_operand:V4SF 0 "nonimmediate_operand"     "=x,v,x,v,m")
    (vec_select:V4SF
      (vec_concat:V8SF
        (match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0")
        (match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v"))
      (parallel [(const_int 6)
             (const_int 7)
             (const_int 2)
             (const_int 3)])))]
  "TARGET_SSE && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
  "@
   movhlps\t{%2, %0|%0, %2}
   vmovhlps\t{%2, %1, %0|%0, %1, %2}
   movlps\t{%H2, %0|%0, %H2}
   vmovlps\t{%H2, %1, %0|%0, %1, %H2}
   %vmovhps\t{%2, %0|%q0, %2}"
  [(set_attr "isa" "noavx,avx,noavx,avx,*")
   (set_attr "type" "ssemov2")
   (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex")
   (set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")])

indeed the "mode" attr says V2SF for the memory (store) alternative but
this is for a V4SFmode.  Also LRA doesn't seem to understand that
the match_operand:1 should be the same memory.

(insn 1161 1160 1578 59 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
                (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
        (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
                        (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
                (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]))
            (parallel [
                    (const_int 6 [0x6])
                    (const_int 7 [0x7])
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
     (nil))

it's also odd that I don't see the spill to (mem/c:V4SF (plus:DI (reg/f:DI 7
sp)(const_int 112 [0x70])) that LRA would need to generate for the input
operand.

Clearly something is odd here and clearly this alternative is bad.

[Bug target/117562] [15 Regression] 40% slowdown of 482.sphinx3 on Zen4, Zen5 since r15-5120-g9a62c149589103

Reply via email to