Hi,
As mentioned in PR, for the following test-case:

typedef unsigned char uint8_t;

static inline uint8_t
x264_clip_uint8(uint8_t x)
{
  uint8_t t = -x;
  uint8_t t1 = x & ~63;
  return (t1 != 0) ? t : x;
}

void
mc_weight(uint8_t *restrict dst, uint8_t *restrict src, int n)
{
  for (int x = 0; x < n*16; x++)
    dst[x] = x264_clip_uint8(src[x]);
}

-O3 -mcpu=generic+sve generates following code for the inner loop:

.L3:
        ld1b    z0.b, p0/z, [x1, x2]
        movprfx z2, z0
        and     z2.b, z2.b, #0xc0
        movprfx z1, z0
        neg     z1.b, p1/m, z0.b
        cmpeq   p2.b, p1/z, z2.b, #0
        sel     z0.b, p2, z0.b, z1.b
        st1b    z0.b, p0, [x0, x2]
        add     x2, x2, x4
        whilelo p0.b, w2, w3
        b.any   .L3

The sel is redundant since we could conditionally negate z0 based on
the predicate
comparing z2 with 0.

As suggested in the PR, the attached patch, introduces a new
conditional internal function .COND_NEG, and in gimple-isel replaces
the following sequence:
   op2 = -op1
   op0 = A cmp B
   lhs = op0 ? op1 : op2

with:
   op0 = A inverted_cmp B
   lhs = .COND_NEG (op0, op1, op1).

lhs = .COD_NEG (op0, op1, op1)
implies
lhs = neg (op1) if cond is true OR fall back to op1 if cond is false.

With patch, it generates the following code-gen:
.L3:
        ld1b    z0.b, p0/z, [x1, x2]
        movprfx z1, z0
        and     z1.b, z1.b, #0xc0
        cmpne   p1.b, p2/z, z1.b, #0
        neg     z0.b, p1/m, z0.b
        st1b    z0.b, p0, [x0, x2]
        add     x2, x2, x4
        whilelo p0.b, w2, w3
        b.any   .L3

While it seems to work for this test-case, I am not entirely sure if
the patch is correct. Does it look in the right direction ?

Thanks,
Prathamesh

Attachment: pr93183-1.diff
Description: Binary data

Reply via email to