https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118174
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- So I think int gsum; int foo (signed char *p1, signed char *p2) { int sum = 0; for (int i = 0; i < 32; i++) sum += __builtin_abs (p1[i] - p2[i]); gsum = sum; } is handled correctly(?) (btw, I see signed ops): foo: .LFB0: .cfi_startproc mov x2, x0 adrp x3, .LANCHOR0 ldp q0, q31, [x1] ldp q1, q28, [x2] sabdl2 v29.8h, v1.16b, v0.16b sabdl2 v30.8h, v28.16b, v31.16b sabal v29.8h, v1.8b, v0.8b sabal v30.8h, v28.8b, v31.8b saddlp v29.4s, v29.8h sadalp v29.4s, v30.8h addv s29, v29.4s str s29, [x3, #:lo12:.LANCHOR0] in fact the expand dump shows ;; _88 = .REDUC_PLUS (vect_patt_26.11_80); [tail call] (insn 14 13 15 (set (reg:V16QI 115 [ vect__3.6_72 ]) (mem:V16QI (plus:DI (reg/v/f:DI 108 [ p1 ]) (const_int 16 [0x10])) [0 MEM <vector(16) signed char> [(signed char *)p1_13(D) + 16B]+0 S16 A8])) "t.c":7:29 -1 (nil)) ... ;; return _88; (insn 28 27 29 (set (reg:V16QI 125 [ vect__3.6_72 ]) (mem:V16QI (plus:DI (reg/v/f:DI 108 [ p1 ]) (const_int 16 [0x10])) [0 MEM <vector(16) signed char> [(signed char *)p1_13(D) + 16B]+0 S16 A8])) "t.c":7:29 -1 (nil)) ... so we're indeed expanding the chain twice somehow. One obvious issue is that we're failing to skip expanding the call itself because we're pre-empted by tail-call handling (of course it isn't a "tailcall", but still). So not applying TER to tail-call direct internal calls fixes this (the alternative to not tail-call internal functions is more invasive at this point). Testing the obvious patch.