[Bug tree-optimization/119351] [14 Regression] Incorrect forall masking for AND reduction in early break

cvs-commit at gcc dot gnu.org via Gcc-bugs Mon, 28 Apr 2025 05:00:45 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119351


--- Comment #24 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-14 branch has been updated by Tamar Christina
<tnfch...@gcc.gnu.org>:

https://gcc.gnu.org/g:8f4df0d836f2618933f2a3e0f14a478af52aec37

commit r14-11690-g8f4df0d836f2618933f2a3e0f14a478af52aec37
Author: Tamar Christina <tamar.christ...@arm.com>
Date:   Mon Apr 28 12:59:54 2025 +0100

    middle-end: fix masking for partial vectors and early break [PR119351]

    The following testcase shows an incorrect masked codegen:

    #define N 512
    #define START 1
    #define END 505

    int x[N] __attribute__((aligned(32)));

    int __attribute__((noipa))
    foo (void)
    {
      int z = 0;
      for (unsigned int i = START; i < END; ++i)
        {
          z++;
          if (x[i] > 0)
            continue;

          return z;
        }
      return -1;
    }

    notice how there's a continue there instead of a break.  This means we
generate
    a control flow where success stays within the loop iteration:

      mask_patt_9.12_46 = vect__1.11_45 > { 0, 0, 0, 0 };
      vec_mask_and_47 = mask_patt_9.12_46 & loop_mask_41;
      if (vec_mask_and_47 == { -1, -1, -1, -1 })
        goto <bb 4>; [41.48%]
      else
        goto <bb 15>; [58.52%]

    However when loop_mask_41 is a partial mask this comparison can lead to an
    incorrect match.  In this case the mask is:

      # loop_mask_41 = PHI <next_mask_63(6), { 0, -1, -1, -1 }(2)>

    due to peeling for alignment with masking and compiling with
    -msve-vector-bits=128.

    At codegen time we generate:

            ptrue   p15.s, vl4
            ptrue   p7.b, vl1
            not     p7.b, p15/z, p7.b
    .L5:
            ld1w    z29.s, p7/z, [x1, x0, lsl 2]
            cmpgt   p7.s, p7/z, z29.s, #0
            not     p7.b, p15/z, p7.b
            ptest   p15, p7.b
            b.none  .L2
            ...<early exit>...

    Here the basic blocks are rotated and a not is generated.
    But the generated not is unmasked (or predicated over an ALL true mask in
this
    case).  This has the unintended side-effect of flipping the results of the
    inactive lanes (which were zero'd by the cmpgt) into -1.  Which then
incorrectly
    causes us to not take the branch to .L2.

    This is happening because we're not comparing against the right value for
the
    forall case.  This patch gets rid of the forall case by rewriting the
    if(all(mask)) into if (!all(mask)) which is the same as if (any(~mask)) by
    negating the masks and flipping the branches.

            1. For unmasked loops we simply reduce the ~mask.
            2. For masked loops we reduce (~mask & loop_mask) which is the same
as
               doing (mask & loop_mask) ^ loop_mask.

    For the above we now generate:

    .L5:
            ld1w    z28.s, p7/z, [x1, x0, lsl 2]
            cmple   p7.s, p7/z, z28.s, #0
            ptest   p15, p7.b
            b.none  .L2

    This fixes gromacs with > 1 OpenMP threads and improves performance.

    gcc/ChangeLog:

            PR tree-optimization/119351
            * tree-vect-stmts.cc (vectorizable_early_exit): Mask both operands
of
            the gcond for partial masking support.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/119351
            * gcc.target/aarch64/sve/pr119351.c: New test.
            * gcc.target/aarch64/sve/pr119351_run.c: New test.

        (cherry picked from commit 7cf5503e0af52f5b726da4274a148590c57a458a)

[Bug tree-optimization/119351] [14 Regression] Incorrect forall masking for AND reduction in early break

Reply via email to