Hi, As mentioned in PR, for the following test-case: typedef unsigned char uint8_t;
static inline uint8_t x264_clip_uint8(uint8_t x) { uint8_t t = -x; uint8_t t1 = x & ~63; return (t1 != 0) ? t : x; } void mc_weight(uint8_t *restrict dst, uint8_t *restrict src, int n) { for (int x = 0; x < n*16; x++) dst[x] = x264_clip_uint8(src[x]); } -O3 -mcpu=generic+sve generates following code for the inner loop: .L3: ld1b z0.b, p0/z, [x1, x2] movprfx z2, z0 and z2.b, z2.b, #0xc0 movprfx z1, z0 neg z1.b, p1/m, z0.b cmpeq p2.b, p1/z, z2.b, #0 sel z0.b, p2, z0.b, z1.b st1b z0.b, p0, [x0, x2] add x2, x2, x4 whilelo p0.b, w2, w3 b.any .L3 The sel is redundant since we could conditionally negate z0 based on the predicate comparing z2 with 0. As suggested in the PR, the attached patch, introduces a new conditional internal function .COND_NEG, and in gimple-isel replaces the following sequence: op2 = -op1 op0 = A cmp B lhs = op0 ? op1 : op2 with: op0 = A inverted_cmp B lhs = .COND_NEG (op0, op1, op1). lhs = .COD_NEG (op0, op1, op1) implies lhs = neg (op1) if cond is true OR fall back to op1 if cond is false. With patch, it generates the following code-gen: .L3: ld1b z0.b, p0/z, [x1, x2] movprfx z1, z0 and z1.b, z1.b, #0xc0 cmpne p1.b, p2/z, z1.b, #0 neg z0.b, p1/m, z0.b st1b z0.b, p0, [x0, x2] add x2, x2, x4 whilelo p0.b, w2, w3 b.any .L3 While it seems to work for this test-case, I am not entirely sure if the patch is correct. Does it look in the right direction ? Thanks, Prathamesh
pr93183-1.diff
Description: Binary data