https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109603
Bug ID: 109603 Summary: Vectorization failure for a small loop containing a simple branch Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jiangning.liu at amperecomputing dot com Target Milestone: --- For the following small case, #include <stdlib.h> #include <stdio.h> #include <time.h> #define NANOSECS 1000000000L int main(int argc, char * argv[]) { long long i, even, odd, c; char *eptr; struct timespec ts0, ts1; c = strtoll(argv[1], &eptr, 10); printf("c = %lld \n", c); even = odd = 0; clock_gettime(CLOCK_MONOTONIC, &ts0); for (i = 0; i < c; i++) { if (i % 2) even++; else odd++; } clock_gettime(CLOCK_MONOTONIC, &ts1); printf("even = %lld odd = %lld\n", even, odd); printf("elapsed %ld\n", (ts1.tv_sec - ts0.tv_sec) * NANOSECS + (ts1.tv_nsec - ts0.tv_nsec)); return 0; } Using "-mcpu=neoverse-n1" gcc fails to vectorize the loop, while using "-mcpu=neoverse-n1 -mtune=generic" or without -mcpu and -mtune, gcc can successfully vectorize it. ============ The scalar version for the loop is like, 400660: 36000381 tbz w1, #0, 4006d0 <main+0xd0> 400664: 91000694 add x20, x20, #0x1 400668: 91000421 add x1, x1, #0x1 40066c: eb01027f cmp x19, x1 400670: 54ffff81 b.ne 400660 <main+0x60> // b.any ... 4006d0: 910006b5 add x21, x21, #0x1 4006d4: 17ffffe5 b 400668 <main+0x68> The vectorization version is like below (factor=2), and it is much faster on neoverse-n1. 400670: 91000421 add x1, x1, #0x1 400674: 4e241c20 and v0.16b, v1.16b, v4.16b 400678: 4ee48421 add v1.2d, v1.2d, v4.2d 40067c: 4ee09800 cmeq v0.2d, v0.2d, #0 400680: 6e631ca0 bsl v0.16b, v5.16b, v3.16b 400684: 4ee08442 add v2.2d, v2.2d, v0.2d 400688: eb13003f cmp x1, x19 40068c: 54ffff21 b.ne 400670 <main+0x70> // b.any ============ It seems neoverse-n1 vector cost model is inaccurate and does work well for this small case. (1) For -mcpu=neoverse-n1 version, the vectorization cost model result is Vector inside of loop cost: 12 Scalar iteration cost: 5 12 > 5*2, so gcc doesn't think it's worth doing vectorization for factor=2. (2) For the version without -mcpu , the vectorization cost model result is Vector inside of loop cost: 4 Scalar iteration cost: 5 Actually, the loop body cost for vectorized version is 4, which is too small, and it looks incorrect as well, although in reality vectorized version is faster than scalar version. In contract, the 12 for -mcpu=neoverse-n1 looks more reasonable, although it blocked the vectorization.