https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118174

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
So I think

int gsum;
int
foo (signed char *p1, signed char *p2)
{
  int sum = 0;
  for (int i = 0; i < 32; i++)
    sum += __builtin_abs (p1[i] - p2[i]);
  gsum = sum;
}

is handled correctly(?) (btw, I see signed ops):

foo:
.LFB0:
        .cfi_startproc
        mov     x2, x0
        adrp    x3, .LANCHOR0
        ldp     q0, q31, [x1]
        ldp     q1, q28, [x2]
        sabdl2  v29.8h, v1.16b, v0.16b
        sabdl2  v30.8h, v28.16b, v31.16b
        sabal   v29.8h, v1.8b, v0.8b
        sabal   v30.8h, v28.8b, v31.8b
        saddlp  v29.4s, v29.8h
        sadalp  v29.4s, v30.8h
        addv    s29, v29.4s
        str     s29, [x3, #:lo12:.LANCHOR0]

in fact the expand dump shows

;; _88 = .REDUC_PLUS (vect_patt_26.11_80); [tail call]

(insn 14 13 15 (set (reg:V16QI 115 [ vect__3.6_72 ])
        (mem:V16QI (plus:DI (reg/v/f:DI 108 [ p1 ])
                (const_int 16 [0x10])) [0 MEM <vector(16) signed char> [(signed
char *)p1_13(D) + 16B]+0 S16 A8])) "t.c":7:29 -1
     (nil))
...

;; return _88;

(insn 28 27 29 (set (reg:V16QI 125 [ vect__3.6_72 ])
        (mem:V16QI (plus:DI (reg/v/f:DI 108 [ p1 ])
                (const_int 16 [0x10])) [0 MEM <vector(16) signed char> [(signed
char *)p1_13(D) + 16B]+0 S16 A8])) "t.c":7:29 -1
     (nil))

...

so we're indeed expanding the chain twice somehow.  One obvious issue is that
we're failing to skip expanding the call itself because we're pre-empted by
tail-call handling (of course it isn't a "tailcall", but still).

So not applying TER to tail-call direct internal calls fixes this (the
alternative to not tail-call internal functions is more invasive at this
point).

Testing the obvious patch.

Reply via email to