[Bug target/119108] [15 Regression] AArch64 Commit 'vect: Force alignment peeling ...' (r15-6807-g68326d5d1a593d) causes regression in Snappy workload for -mcpu=neoverse-v2.

tnfchris at gcc dot gnu.org via Gcc-bugs Tue, 25 Mar 2025 03:15:01 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108


--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Sorry for the slow response, had a few days off.

The regression here can be reproduced through this example loop:
https://godbolt.org/z/jnGe5x4P7

for the current loop in snappy what you want is -UALIGNED_DATA -UALIGNED_LOAD
-ULONG_CTR

and indeed if you benchmark this you'll see that it's slower than the version
with -fno-tree-vectorize.

This is because the vector loop can't be entered as the source arrays a and b
are misaligned wrt to vector alignment.
It seems that when we version loops we don't enforce the vector alignment,
which would have helped in this case as
the arrays are local data.

Now by playing with the value of ALIGNED_DATA and ALIGNED_LOAD you can generate
various forms of this loop.

Any loop with ALIGNED_DATA on will enter the vector loop. 

For instance I'm comparing

gcc snappy-rep.c -O3 -fno-tree-vectorize -o snappy-rep-baseline-aligned.exe
-DALIGNED_LOAD -DALIGNED_DATA
gcc snappy-rep.c -O3 -fno-tree-vectorize -o snappy-rep-baseline-unaligned.exe
-UALIGNED_LOAD -DALIGNED_DATA
gcc snappy-rep.c -O3 -o snappy-rep-aligned-load-aligned-data.exe -DALIGNED_LOAD
-DALIGNED_DATA
gcc snappy-rep.c -O3 -o snappy-rep-aligned-load-unaligned-data.exe
-DALIGNED_LOAD -UALIGNED_DATA
gcc snappy-rep.c -O3 -o snappy-rep-unaligned-load-data.exe -UALIGNED_LOAD
-UALIGNED_DATA
gcc snappy-rep.c -O3 -o snappy-rep-unaligned-load-aligned-data.exe
-UALIGNED_LOAD -DALIGNED_DATA
benchmark snappy-rep-baseline-aligned.exe snappy-rep-baseline-unaligned.exe
snappy-rep-unaligned-load-data.exe snappy-rep-aligned-load-aligned-data.exe
snappy-rep-unaligned-load-aligned-data.exe 
snappy-rep-aligned-load-unaligned-data.exe

and from this we can see that the vector code would have been faster than a
byte loop, but doesn't beat the vectorized unaligned loop.
There is two reasons for this.  First off is that we force an unroll due to the
unfortunate side effect of us building an SLP tree with the loop IVs

note:   Final SLP tree for instance 0x1236bce0:
note:   node 0x1225bbb8 (max_nunits=4, refcnt=2) vector(4) int
note:   op template: i_12 = i_21 + 8;
note:           stmt 0 i_12 = i_21 + 8;
note:           children 0x1225bc50 0x1225bce8
note:   node 0x1225bc50 (max_nunits=4, refcnt=2) vector(4) int
note:   op template: i_21 = PHI <i_12(7), 0(6)>
note:           [l] stmt 0 i_21 = PHI <i_12(7), 0(6)>
note:           children (nil) (nil)
note:   node (constant) 0x1225bce8 (max_nunits=1, refcnt=1)
note:           { 8 }

and because of the type being integer this ends up picking a higher VF than the
other instances:

note:   Final SLP tree for instance 0x1236a4b0:
note:   node 0x1225b9f0 (max_nunits=2, refcnt=2) vector(2) long unsigned int
note:   op template: b_14 = b_20 + 8;
note:           stmt 0 b_14 = b_20 + 8;
note:           children 0x1225ba88 0x1225bb20
note:   node 0x1225ba88 (max_nunits=2, refcnt=2) vector(2) long unsigned int
note:   op template: b_20 = PHI <b_14(7), b_9(D)(6)>
note:           [l] stmt 0 b_20 = PHI <b_14(7), b_9(D)(6)>
note:           children (nil) (nil)
note:   node (constant) 0x1225bb20 (max_nunits=1, refcnt=1)
note:           { 8 }

Which means the DI mode loop has to be unrolled once and so we generate:

.L6:
        ldr     q31, [x5, x3]
        ldr     q25, [x4, x3]
        ldr     q30, [x1, x3]
        ldr     q26, [x0, x3]
        add     x3, x3, 32
        cmeq    v31.2d, v31.2d, v25.2d
        cmeq    v30.2d, v30.2d, v26.2d
        orr     v31.16b, v31.16b, v30.16b
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x9, d31
        cbz     x9, .L4

Which means there's both a load throughput bottleneck here and an fmov
bottleneck.

The cost model at the moment doesn't bottle throughput bottlenecks for early
break.
This means during costing while we know the vector code to have a load
bottleneck
we still think we can issue all the vector operations n 2 cycles:

note:  Original vector body cost = 34
note:  Scalar issue estimate:
note:    load operations = 2
note:    store operations = 0
note:    general operations = 4
note:    reduction latency = 0
note:    estimated min cycles per iteration = 1.000000
note:    estimated cycles per vector iteration (for VF 4) = 4.000000
note:  Vector issue estimate:
note:    load operations = 4
note:    store operations = 0
note:    general operations = 9
note:    reduction latency = 0
note:    estimated min cycles per iteration = 2.250000
note:  Cost model analysis: 

That's why even if we enter the vector loop, the vector loop would be slower
than
the uint64_t scalar code which doesn't have a throughput bottleneck.

Unfortunately I don't know what I can do here for GCC 15.  We have plans to
significantly rework cost modelling next year to model throughput more
extensively
but it's not a stage-4 patch..

I could maybe increase the cost of early break loops with unroll factors to
reflect the increase bottleneck.. This could work and be theoretically sound.

I'll give it a try...

But the question remains, can early break vectorization code ever beat such
hand
written optimized scalar code?

First thing is we should remove the unroll. this can be done by using a long
for
the counter rather than an int.  We should be able to in the vectorizer also
use
V2SI here instead of V4SI.  That's another thing that is good to add.  It seems
reasonable for early break that we want to keep the unroll factor low.  
Because
early breaks stop the rest of the vector code from running until you know if
are continuing or not.

using -UALIGNED_DATA -UALIGNED_LOAD -DLONG_CTR https://godbolt.org/z/v5qbPr18E

we get better code:

.L6:
        ldr     q31, [x1, x2]
        add     x3, x3, 1
        ldr     q26, [x0, x2]
        add     x2, x2, 16
        cmeq    v31.2d, v31.2d, v26.2d
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x7, d31
        cbz     x7, .L4

This is unfortunately about even with the scalar code.  However if we replace
the Adv. SIMD sequence with an SVE compare, and make sure we eliminate the
ptest
we end up with:

.L6:
        ldr     q26, [x1, x2]
        add     x3, x3, 1
        ldr     q27, [x0, x2]
        cmpne   p15.d, p7/z, z26.d, z27.d
        b.none  .L4

Which is indeed faster:

+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| name        | vs snappy-rep-baseline-unaligned3.exe   | vs
snappy-rep-aligned-load-aligned-data3.exe   | vs
snappy-rep-unaligned-load-aligned-data3.exe   |
+=============+=========================================+================================================+==================================================+
| cortex-a510 | -78.57%                                 | -85.64%              
                         | -89.0%                                           |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| cortex-a520 | -70.23%                                 | -79.01%              
                         | -83.15%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| cortex-x925 | -69.58%                                 | -77.72%              
                         | -80.1%                                           |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| neoverse v3 | -69.26%                                 | -76.33%              
                         | -78.54%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| cortex-x3   | -69.21%                                 | -72.9%               
                         | -78.64%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| cortex-x4   | -68.96%                                 | -76.69%              
                         | -79.16%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| neoverse v2 | -68.52%                                 | -72.84%              
                         | -78.34%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| cortex-x2   | -66.6%                                  | -70.16%              
                         | -75.48%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| cortex-a710 | -66.23%                                 | -69.19%              
                         | -75.07%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| neoverse n2 | -65.67%                                 | -68.77%              
                         | -74.58%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| neoverse v1 | -60.97%                                 | -63.47%              
                         | -68.74%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+
| neoverse n3 | -52.01%                                 | -57.57%              
                         | -59.48%                                          |
+-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+

The comparisons here are against the naive scalar loop with bytes e.g.

-O3 -fno-tree-vectorize -o snappy-rep-baseline-aligned.exe -DALIGNED_LOAD
-DALIGNED_DATA

So here we see that vectorization can beat both the naive and the current
snappy code, by on average ~10%.

So how do we get there?

I have a patch to use the SVE compare for Adv. SIMD for GCC 16.  I'll work on
one limiting the unroll factor next.

Then we have the fact that we don't enter the vector loop.  we can get that in
two ways:

1. We should implement peeling for alignment on mutually misaligned buffers,
which would catch some cases like this one
2. We should implement first faulting loads support in the vectorizer.  This
loop is better done as SVE.

So for now, I'll try to have the cost model reject such loops where the scalar
is hand-optimized and the vector is unrolled.

and the rest is for GCC 16

[Bug target/119108] [15 Regression] AArch64 Commit 'vect: Force alignment peeling ...' (r15-6807-g68326d5d1a593d) causes regression in Snappy workload for -mcpu=neoverse-v2.

Reply via email to