Tamar Christina <tamar.christ...@arm.com> writes:
> Hi All,
>
> Here’s a respin of the patch.
>
> The following example
>
> void f5(float * restrict z0, float * restrict z1, float *restrict x,
>       float * restrict y, float c, int n)
> {
>     for (int i = 0; i < n; i++) {
>         float a = x[i];
>         float b = y[i];
>         if (a > b) {
>             z0[i] = a + b;
>             if (a > c) {
>                 z1[i] = a - b;
>             }
>         }
>     }
> }
>
> generates currently:
>
>         ptrue   p3.b, all
>         ld1w    z1.s, p1/z, [x2, x5, lsl 2]
>         ld1w    z2.s, p1/z, [x3, x5, lsl 2]
>         fcmgt   p0.s, p3/z, z1.s, z0.s
>         fcmgt   p2.s, p1/z, z1.s, z2.s
>         fcmgt   p0.s, p0/z, z1.s, z2.s
>         and     p0.b, p0/z, p1.b, p1.b
>
> The conditions for a > b and a > c become separate comparisons.
>
> After this patch we generate:
>
>         ld1w    z1.s, p0/z, [x2, x5, lsl 2]
>         ld1w    z2.s, p0/z, [x3, x5, lsl 2]
>         fcmgt   p1.s, p0/z, z1.s, z2.s
>         fcmgt   p1.s, p1/z, z1.s, z0.s
>
> Where the condition a > b && a > c are folded by using the predicate result of
> the previous compare and thus allows the removal of one of the compares.
>
> When never a mask is being generated from an BIT_AND we mask the operands of
> the and instead and then just AND the result.
>
> This allows us to be able to CSE the masks and generate the right combination.
> However because re-assoc will try to re-order the masks in the & we have to 
> now
> perform a small local CSE on the vectorized loop is vectorization is 
> successful.
>
> Note: This patch series is working incrementally towards generating the most
>       efficient code for this and other loops in small steps.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and no 
> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       * tree-vect-stmts.c (prepare_load_store_mask): When combining two masks
>       mask the operands instead of the combined operation.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/sve/pred-combine-and.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pred-combine-and.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pred-combine-and.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..ed7fb591ec69dbdafe27fc9aa08a0b0910c94003
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pred-combine-and.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --save-temps" } */
> +
> +void f5(float * restrict z0, float * restrict z1, float *restrict x, float * 
> restrict y, float c, int n)
> +{
> +    for (int i = 0; i < n; i++) {
> +        float a = x[i];
> +        float b = y[i];
> +        if (a > b) {
> +            z0[i] = a + b;
> +            if (a > c) {
> +                z1[i] = a - b;
> +            }
> +        }
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tfcmgt\tp[0-9]+\.s, p[0-9]+/z, 
> z[0-9]+\.s, z[0-9]+\.s} 2 } } */
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 
> 1f56e10709e8f27d768c04f7ef914e2cd9347c36..27ee48aea429810a37777d907435a92b8fd1817d
>  100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -6302,10 +6302,39 @@ vectorizable_operation (vec_info *vinfo,
>       }
>        else
>       {
> +       /* When combining two masks check is either of them has already been
> +          combined with a loop mask, if that's the case we can mark that the

I think something like:
   s/has already been combined with/is elsewhere combined with/
might be clearer.  The other statement that ANDs with the loop mask might
come before or after this one.  In the latter case it won't have been
vectorised yet.

> +          new combined mask doesn't need to be combined with a loop mask.  */

The way I imagined this last part working is that, after vectorising
a BIT_AND_EXPR like this:

  vop0' = vop & loop_mask_N
  vres = vop0' & vop1

we should record that vres is effectively already ANDed with loop_mask_N,
and so there's no need to create:

  vres' = vres & loop_mask_N

when vectorising the innermost IFN_MASK_STORE.

This could be a new hash_set in the loop_vec_info, containing
(condition, loop_mask) pairs for which condition & loop_mask == condition.
prepare_load_store_mask could check whether (vec_mask, loop_mask) is in
this set and return vec_mask unaltered if so.

This new set would in a sense have the opposite effect of
scalar_cond_masked_key.  scalar_cond_masked_key encourage statements
to create ANDs with the loop mask in cases where they might not have
done otherwise.  The new set would instead tell statements (and
specifically prepare_load_store_mask) that such ANDs aren't necessary.

I guess we should also pass other ANDs with the loop mask through
prepare_load_store_mask (and rename it), so that they get the benefit too.

It looks like the patch tried to do that by entering scalar_dest
into the scalar_cond_masked_set.  I think that's likely to be
counter-productive though, since it would tell other statements
in the loop that they should AND the scalar dest with the loop mask
even if they wouldn't have done normally.

> +       if (masked_loop_p && code == BIT_AND_EXPR)
> +         {
> +           scalar_cond_masked_key cond1 (op0, ncopies);

Nit, but: calling them cond0 and cond1 might be better, to avoid
having cond1 describe op2 rather than op1.

The patch looks good otherwise.  I think we're there in terms of
deciding when to generate the ANDs.  It's just a case of what
to do with the result.

Thanks,
Richard

> +           if (loop_vinfo->scalar_cond_masked_set.contains (cond1))
> +             {
> +               tree mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
> +                                               vectype, i);
> +
> +               vop0 = prepare_load_store_mask (TREE_TYPE (mask), mask, vop0, 
> gsi);
> +               scalar_cond_masked_key cond (scalar_dest, ncopies);
> +               loop_vinfo->scalar_cond_masked_set.add (cond);
> +             }
> +
> +           scalar_cond_masked_key cond2 (op1, ncopies);
> +           if (loop_vinfo->scalar_cond_masked_set.contains (cond2))
> +             {
> +               tree mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
> +                                               vectype, i);
> +
> +               vop1 = prepare_load_store_mask (TREE_TYPE (mask), mask, vop1, 
> gsi);
> +               scalar_cond_masked_key cond (scalar_dest, ncopies);
> +               loop_vinfo->scalar_cond_masked_set.add (cond);
> +             }
> +         }
> +
>         new_stmt = gimple_build_assign (vec_dest, code, vop0, vop1, vop2);
>         new_temp = make_ssa_name (vec_dest, new_stmt);
>         gimple_assign_set_lhs (new_stmt, new_temp);
>         vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
>         if (vec_cvt_dest)
>           {
>             new_temp = build1 (VIEW_CONVERT_EXPR, vectype_out, new_temp);

Reply via email to