15 Regression] x86: poor code generation with 16 byte function arguments and addition

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 08 Aug 2024 02:08:03 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sayle at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
t.c:2:35: note: Cost model analysis: 
_1 + _2 1 times scalar_stmt costs 4 in body
a.x 1 times scalar_load costs 12 in body
a.y 1 times scalar_load costs 12 in body 
a.x 1 times unaligned_load (misalign -1) costs 12 in body
_1 + _2 1 times vector_stmt costs 4 in body
_1 + _2 1 times vec_perm costs 4 in body 
_1 + _2 1 times vec_to_scalar costs 4 in body
_1 + _2 0 times scalar_stmt costs 0 in body
t.c:2:35: note: Cost model analysis for part in loop 0:
  Vector cost: 24
  Scalar cost: 28
t.c:2:35: note: Basic block will be vectorized using SLP

It's vectorizer costing not knowing that a.y and a.x are readily available
in registers and thus the cost of 24 for the two loads doesn't exist.

On the vector side there's the issue that we spill.  We are expanding from

  vect__1.5_5 = MEM <vector(2) long int> [(long int *)&a];
  _6 = VIEW_CONVERT_EXPR<vector(2) unsigned long>(vect__1.5_5);
  _7 = .REDUC_PLUS (_6); [tail call]
  _8 = (long int) _7;
  return _8;

;; _7 = .REDUC_PLUS (_6); [tail call]

(insn 10 9 11 (set (reg:V1TI 108)
        (lshiftrt:V1TI (subreg:V1TI (reg/v:TI 102 [ a ]) 0)
            (const_int 64 [0x40]))) -1
     (nil))

(insn 11 10 12 (set (reg:V2DI 107)
        (subreg:V2DI (reg:V1TI 108) 0)) -1
     (nil))

(insn 12 11 13 (set (reg:V2DI 106)
        (plus:V2DI (reg:V2DI 107)
            (subreg:V2DI (reg/v:TI 102 [ a ]) 0))) -1
     (nil))

(insn 13 12 0 (set (reg:DI 100 [ _7 ])
        (vec_select:DI (reg:V2DI 106)
            (parallel [
                    (const_int 0 [0])
                ]))) -1
     (nil))

that's not unreasonable.  Note we set up TI 102 like

(insn 2 8 3 2 (set (reg:DI 104)
        (reg:DI 5 di [ a ])) "t.c":2:23 -1
     (nil))
(insn 3 2 4 2 (set (reg:DI 105)
        (reg:DI 4 si [ a+8 ])) "t.c":2:23 -1
     (nil))
(insn 4 3 5 2 (set (reg:TI 103)
        (zero_extend:TI (reg:DI 104))) "t.c":2:23 -1
     (nil))
(insn 5 4 6 2 (set (reg:TI 103)
        (ior:TI (and:TI (reg:TI 103)
                (const_wide_int 0x0ffffffffffffffff))
            (ashift:TI (zero_extend:TI (reg:DI 105))
                (const_int 64 [0x40])))) "t.c":2:23 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:TI 102 [ a ])
        (reg:TI 103)) "t.c":2:23 -1
     (nil))

and the task is to "recover" from the back-and-forth.  Unfortunately
combine fails:

Trying 5, 10 -> 12:
    5: r103:TI=zero_extend(r111:DI)<<0x40|zero_extend(r110:DI)
      REG_DEAD r111:DI
      REG_DEAD r110:DI
   10: r108:V1TI=r103:TI#0 0>>0x40
   12: r106:V2DI=r108:V1TI#0+r103:TI#0
      REG_DEAD r108:V1TI
      REG_DEAD r103:TI
Failed to match this instruction: 
(set (reg:V2DI 106)
    (plus:V2DI (subreg:V2DI (lshiftrt:V1TI (subreg:V1TI (ior:TI (ashift:TI
(zero_extend:TI (reg:DI 111))
                            (const_int 64 [0x40]))
                        (zero_extend:TI (reg:DI 110))) 0)
                (const_int 64 [0x40])) 0)
        (subreg:V2DI (ior:TI (ashift:TI (zero_extend:TI (reg:DI 111))
                    (const_int 64 [0x40]))
                (zero_extend:TI (reg:DI 110))) 0)))

why we end up spilling or in the end STV2 doesn't help or what exactly
the reason is neither combine nor late-combine nor forwprop help isn't clear.

Of course the vectorizer costing is off here - load/store cost is dominating
it in general and I've mentioned decreasing the load/store costing compared
to the arithmetic stmt costing.

Still I would expect RTL optimizations to recover from this failure and
re-surrect the scalar add of the incoming register arguments.

Roger is very good at analyzing this stuff, so CCing him.

The regression is because the target now exposes the two-lane V2DImode
reduc_plus pattern (if that were fed by a much larger sequence of
vectorizable arithmetic it should be a win).

[Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition

Reply via email to