https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sayle at gcc dot gnu.org --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- t.c:2:35: note: Cost model analysis: _1 + _2 1 times scalar_stmt costs 4 in body a.x 1 times scalar_load costs 12 in body a.y 1 times scalar_load costs 12 in body a.x 1 times unaligned_load (misalign -1) costs 12 in body _1 + _2 1 times vector_stmt costs 4 in body _1 + _2 1 times vec_perm costs 4 in body _1 + _2 1 times vec_to_scalar costs 4 in body _1 + _2 0 times scalar_stmt costs 0 in body t.c:2:35: note: Cost model analysis for part in loop 0: Vector cost: 24 Scalar cost: 28 t.c:2:35: note: Basic block will be vectorized using SLP It's vectorizer costing not knowing that a.y and a.x are readily available in registers and thus the cost of 24 for the two loads doesn't exist. On the vector side there's the issue that we spill. We are expanding from vect__1.5_5 = MEM <vector(2) long int> [(long int *)&a]; _6 = VIEW_CONVERT_EXPR<vector(2) unsigned long>(vect__1.5_5); _7 = .REDUC_PLUS (_6); [tail call] _8 = (long int) _7; return _8; ;; _7 = .REDUC_PLUS (_6); [tail call] (insn 10 9 11 (set (reg:V1TI 108) (lshiftrt:V1TI (subreg:V1TI (reg/v:TI 102 [ a ]) 0) (const_int 64 [0x40]))) -1 (nil)) (insn 11 10 12 (set (reg:V2DI 107) (subreg:V2DI (reg:V1TI 108) 0)) -1 (nil)) (insn 12 11 13 (set (reg:V2DI 106) (plus:V2DI (reg:V2DI 107) (subreg:V2DI (reg/v:TI 102 [ a ]) 0))) -1 (nil)) (insn 13 12 0 (set (reg:DI 100 [ _7 ]) (vec_select:DI (reg:V2DI 106) (parallel [ (const_int 0 [0]) ]))) -1 (nil)) that's not unreasonable. Note we set up TI 102 like (insn 2 8 3 2 (set (reg:DI 104) (reg:DI 5 di [ a ])) "t.c":2:23 -1 (nil)) (insn 3 2 4 2 (set (reg:DI 105) (reg:DI 4 si [ a+8 ])) "t.c":2:23 -1 (nil)) (insn 4 3 5 2 (set (reg:TI 103) (zero_extend:TI (reg:DI 104))) "t.c":2:23 -1 (nil)) (insn 5 4 6 2 (set (reg:TI 103) (ior:TI (and:TI (reg:TI 103) (const_wide_int 0x0ffffffffffffffff)) (ashift:TI (zero_extend:TI (reg:DI 105)) (const_int 64 [0x40])))) "t.c":2:23 -1 (nil)) (insn 6 5 7 2 (set (reg/v:TI 102 [ a ]) (reg:TI 103)) "t.c":2:23 -1 (nil)) and the task is to "recover" from the back-and-forth. Unfortunately combine fails: Trying 5, 10 -> 12: 5: r103:TI=zero_extend(r111:DI)<<0x40|zero_extend(r110:DI) REG_DEAD r111:DI REG_DEAD r110:DI 10: r108:V1TI=r103:TI#0 0>>0x40 12: r106:V2DI=r108:V1TI#0+r103:TI#0 REG_DEAD r108:V1TI REG_DEAD r103:TI Failed to match this instruction: (set (reg:V2DI 106) (plus:V2DI (subreg:V2DI (lshiftrt:V1TI (subreg:V1TI (ior:TI (ashift:TI (zero_extend:TI (reg:DI 111)) (const_int 64 [0x40])) (zero_extend:TI (reg:DI 110))) 0) (const_int 64 [0x40])) 0) (subreg:V2DI (ior:TI (ashift:TI (zero_extend:TI (reg:DI 111)) (const_int 64 [0x40])) (zero_extend:TI (reg:DI 110))) 0))) why we end up spilling or in the end STV2 doesn't help or what exactly the reason is neither combine nor late-combine nor forwprop help isn't clear. Of course the vectorizer costing is off here - load/store cost is dominating it in general and I've mentioned decreasing the load/store costing compared to the arithmetic stmt costing. Still I would expect RTL optimizations to recover from this failure and re-surrect the scalar add of the incoming register arguments. Roger is very good at analyzing this stuff, so CCing him. The regression is because the target now exposes the two-lane V2DImode reduc_plus pattern (if that were fed by a much larger sequence of vectorizable arithmetic it should be a win).