https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109603

            Bug ID: 109603
           Summary: Vectorization failure for a small loop containing a
                    simple branch
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the following small case,

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define NANOSECS        1000000000L

int main(int argc, char * argv[])
{
  long long i, even, odd, c;
  char *eptr;
  struct timespec ts0, ts1;

  c = strtoll(argv[1], &eptr, 10);

  printf("c = %lld \n", c);

  even = odd = 0;

  clock_gettime(CLOCK_MONOTONIC, &ts0);

  for (i = 0; i < c; i++)
  {
    if (i % 2) 
      even++;
    else
      odd++;
  }

  clock_gettime(CLOCK_MONOTONIC, &ts1);

  printf("even = %lld odd = %lld\n", even, odd);
  printf("elapsed %ld\n", (ts1.tv_sec - ts0.tv_sec) * NANOSECS + (ts1.tv_nsec -
ts0.tv_nsec));

  return 0;
}

Using "-mcpu=neoverse-n1" gcc fails to vectorize the loop, while using
"-mcpu=neoverse-n1 -mtune=generic" or without -mcpu and -mtune, gcc can
successfully vectorize it.

============

The scalar version for the loop is like,

  400660:       36000381        tbz     w1, #0, 4006d0 <main+0xd0>
  400664:       91000694        add     x20, x20, #0x1
  400668:       91000421        add     x1, x1, #0x1
  40066c:       eb01027f        cmp     x19, x1
  400670:       54ffff81        b.ne    400660 <main+0x60>  // b.any
  ...
  4006d0:       910006b5        add     x21, x21, #0x1
  4006d4:       17ffffe5        b       400668 <main+0x68>

The vectorization version is like below (factor=2), and it is much faster on
neoverse-n1.

  400670:       91000421        add     x1, x1, #0x1
  400674:       4e241c20        and     v0.16b, v1.16b, v4.16b
  400678:       4ee48421        add     v1.2d, v1.2d, v4.2d
  40067c:       4ee09800        cmeq    v0.2d, v0.2d, #0
  400680:       6e631ca0        bsl     v0.16b, v5.16b, v3.16b
  400684:       4ee08442        add     v2.2d, v2.2d, v0.2d
  400688:       eb13003f        cmp     x1, x19
  40068c:       54ffff21        b.ne    400670 <main+0x70>  // b.any

============

It seems neoverse-n1 vector cost model is inaccurate and does work well for
this small case.

(1) For -mcpu=neoverse-n1 version, the vectorization cost model result is

Vector inside of loop cost: 12
Scalar iteration cost: 5

12 > 5*2, so gcc doesn't think it's worth doing vectorization for factor=2.

(2) For the version without -mcpu , the vectorization cost model result is

Vector inside of loop cost: 4
Scalar iteration cost: 5

Actually, the loop body cost for vectorized version is 4, which is too small,
and it looks incorrect as well, although in reality vectorized version is
faster than scalar version. In contract, the 12 for -mcpu=neoverse-n1 looks
more reasonable, although it blocked the vectorization.
  • [Bug tree-optimizati... jiangning.liu at amperecomputing dot com via Gcc-bugs

Reply via email to