https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108
--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Sorry for the slow response, had a few days off. The regression here can be reproduced through this example loop: https://godbolt.org/z/jnGe5x4P7 for the current loop in snappy what you want is -UALIGNED_DATA -UALIGNED_LOAD -ULONG_CTR and indeed if you benchmark this you'll see that it's slower than the version with -fno-tree-vectorize. This is because the vector loop can't be entered as the source arrays a and b are misaligned wrt to vector alignment. It seems that when we version loops we don't enforce the vector alignment, which would have helped in this case as the arrays are local data. Now by playing with the value of ALIGNED_DATA and ALIGNED_LOAD you can generate various forms of this loop. Any loop with ALIGNED_DATA on will enter the vector loop. For instance I'm comparing gcc snappy-rep.c -O3 -fno-tree-vectorize -o snappy-rep-baseline-aligned.exe -DALIGNED_LOAD -DALIGNED_DATA gcc snappy-rep.c -O3 -fno-tree-vectorize -o snappy-rep-baseline-unaligned.exe -UALIGNED_LOAD -DALIGNED_DATA gcc snappy-rep.c -O3 -o snappy-rep-aligned-load-aligned-data.exe -DALIGNED_LOAD -DALIGNED_DATA gcc snappy-rep.c -O3 -o snappy-rep-aligned-load-unaligned-data.exe -DALIGNED_LOAD -UALIGNED_DATA gcc snappy-rep.c -O3 -o snappy-rep-unaligned-load-data.exe -UALIGNED_LOAD -UALIGNED_DATA gcc snappy-rep.c -O3 -o snappy-rep-unaligned-load-aligned-data.exe -UALIGNED_LOAD -DALIGNED_DATA benchmark snappy-rep-baseline-aligned.exe snappy-rep-baseline-unaligned.exe snappy-rep-unaligned-load-data.exe snappy-rep-aligned-load-aligned-data.exe snappy-rep-unaligned-load-aligned-data.exe snappy-rep-aligned-load-unaligned-data.exe and from this we can see that the vector code would have been faster than a byte loop, but doesn't beat the vectorized unaligned loop. There is two reasons for this. First off is that we force an unroll due to the unfortunate side effect of us building an SLP tree with the loop IVs note: Final SLP tree for instance 0x1236bce0: note: node 0x1225bbb8 (max_nunits=4, refcnt=2) vector(4) int note: op template: i_12 = i_21 + 8; note: stmt 0 i_12 = i_21 + 8; note: children 0x1225bc50 0x1225bce8 note: node 0x1225bc50 (max_nunits=4, refcnt=2) vector(4) int note: op template: i_21 = PHI <i_12(7), 0(6)> note: [l] stmt 0 i_21 = PHI <i_12(7), 0(6)> note: children (nil) (nil) note: node (constant) 0x1225bce8 (max_nunits=1, refcnt=1) note: { 8 } and because of the type being integer this ends up picking a higher VF than the other instances: note: Final SLP tree for instance 0x1236a4b0: note: node 0x1225b9f0 (max_nunits=2, refcnt=2) vector(2) long unsigned int note: op template: b_14 = b_20 + 8; note: stmt 0 b_14 = b_20 + 8; note: children 0x1225ba88 0x1225bb20 note: node 0x1225ba88 (max_nunits=2, refcnt=2) vector(2) long unsigned int note: op template: b_20 = PHI <b_14(7), b_9(D)(6)> note: [l] stmt 0 b_20 = PHI <b_14(7), b_9(D)(6)> note: children (nil) (nil) note: node (constant) 0x1225bb20 (max_nunits=1, refcnt=1) note: { 8 } Which means the DI mode loop has to be unrolled once and so we generate: .L6: ldr q31, [x5, x3] ldr q25, [x4, x3] ldr q30, [x1, x3] ldr q26, [x0, x3] add x3, x3, 32 cmeq v31.2d, v31.2d, v25.2d cmeq v30.2d, v30.2d, v26.2d orr v31.16b, v31.16b, v30.16b umaxp v31.4s, v31.4s, v31.4s fmov x9, d31 cbz x9, .L4 Which means there's both a load throughput bottleneck here and an fmov bottleneck. The cost model at the moment doesn't bottle throughput bottlenecks for early break. This means during costing while we know the vector code to have a load bottleneck we still think we can issue all the vector operations n 2 cycles: note: Original vector body cost = 34 note: Scalar issue estimate: note: load operations = 2 note: store operations = 0 note: general operations = 4 note: reduction latency = 0 note: estimated min cycles per iteration = 1.000000 note: estimated cycles per vector iteration (for VF 4) = 4.000000 note: Vector issue estimate: note: load operations = 4 note: store operations = 0 note: general operations = 9 note: reduction latency = 0 note: estimated min cycles per iteration = 2.250000 note: Cost model analysis: That's why even if we enter the vector loop, the vector loop would be slower than the uint64_t scalar code which doesn't have a throughput bottleneck. Unfortunately I don't know what I can do here for GCC 15. We have plans to significantly rework cost modelling next year to model throughput more extensively but it's not a stage-4 patch.. I could maybe increase the cost of early break loops with unroll factors to reflect the increase bottleneck.. This could work and be theoretically sound. I'll give it a try... But the question remains, can early break vectorization code ever beat such hand written optimized scalar code? First thing is we should remove the unroll. this can be done by using a long for the counter rather than an int. We should be able to in the vectorizer also use V2SI here instead of V4SI. That's another thing that is good to add. It seems reasonable for early break that we want to keep the unroll factor low. Because early breaks stop the rest of the vector code from running until you know if are continuing or not. using -UALIGNED_DATA -UALIGNED_LOAD -DLONG_CTR https://godbolt.org/z/v5qbPr18E we get better code: .L6: ldr q31, [x1, x2] add x3, x3, 1 ldr q26, [x0, x2] add x2, x2, 16 cmeq v31.2d, v31.2d, v26.2d umaxp v31.4s, v31.4s, v31.4s fmov x7, d31 cbz x7, .L4 This is unfortunately about even with the scalar code. However if we replace the Adv. SIMD sequence with an SVE compare, and make sure we eliminate the ptest we end up with: .L6: ldr q26, [x1, x2] add x3, x3, 1 ldr q27, [x0, x2] cmpne p15.d, p7/z, z26.d, z27.d b.none .L4 Which is indeed faster: +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | name | vs snappy-rep-baseline-unaligned3.exe | vs snappy-rep-aligned-load-aligned-data3.exe | vs snappy-rep-unaligned-load-aligned-data3.exe | +=============+=========================================+================================================+==================================================+ | cortex-a510 | -78.57% | -85.64% | -89.0% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | cortex-a520 | -70.23% | -79.01% | -83.15% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | cortex-x925 | -69.58% | -77.72% | -80.1% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | neoverse v3 | -69.26% | -76.33% | -78.54% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | cortex-x3 | -69.21% | -72.9% | -78.64% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | cortex-x4 | -68.96% | -76.69% | -79.16% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | neoverse v2 | -68.52% | -72.84% | -78.34% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | cortex-x2 | -66.6% | -70.16% | -75.48% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | cortex-a710 | -66.23% | -69.19% | -75.07% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | neoverse n2 | -65.67% | -68.77% | -74.58% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | neoverse v1 | -60.97% | -63.47% | -68.74% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ | neoverse n3 | -52.01% | -57.57% | -59.48% | +-------------+-----------------------------------------+------------------------------------------------+--------------------------------------------------+ The comparisons here are against the naive scalar loop with bytes e.g. -O3 -fno-tree-vectorize -o snappy-rep-baseline-aligned.exe -DALIGNED_LOAD -DALIGNED_DATA So here we see that vectorization can beat both the naive and the current snappy code, by on average ~10%. So how do we get there? I have a patch to use the SVE compare for Adv. SIMD for GCC 16. I'll work on one limiting the unroll factor next. Then we have the fact that we don't enter the vector loop. we can get that in two ways: 1. We should implement peeling for alignment on mutually misaligned buffers, which would catch some cases like this one 2. We should implement first faulting loads support in the vectorizer. This loop is better done as SVE. So for now, I'll try to have the cost model reject such loops where the scalar is hand-optimized and the vector is unrolled. and the rest is for GCC 16