https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Bug ID: 80846 Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc's tune=generic strategy for horizontal sums at the end of an AVX2 auto-vectorized reduction is sub-optimal even for Intel CPUs, and horrible for CPUs that split 256b ops into 2 uops (i.e. AMD including Ryzen). The first step should always be vextracti/f128 to reduce down to xmm vectors. It has low latency and good throughput on all CPUs. Keep narrowing in half until you're down to one element. See also http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026 The one exception is a horizontal sum of 8-bit elements without overflow, where you should use VPSADBW ymm against a zeroed to do a horizontal add without overflow, and then extract hsum the resulting four 64-bit values. (For signed 8-bit, you can range-shift to unsigned and then correct the scalar hsum result.) ---- gcc tends to keep working with ymm vectors until the last step, even using VPERM2I128 (which is horrible on AMD CPUs, e.g. Ryzen: 8 uops with 3c throughput/latency vs. VEXTRACTI128 being 1 uop with 1c latency and 0.33c throughput). // https://godbolt.org/g/PwX6yT int sumint(const int arr[]) { arr = __builtin_assume_aligned(arr, 64); int sum=0; for (int i=0 ; i<1024 ; i++) sum+=arr[i]; return sum; } Compiled with gcc8.0.0 20170520 -mavx2 -funroll-loops -O3 -std=gnu11, we get vpxor %xmm7, %xmm7, %xmm7 leaq 4096(%rdi), %rax .L24: vpaddd (%rdi), %ymm7, %ymm0 addq $256, %rdi # doing this later would let more instructions use a disp8 instead of disp32 vpaddd -224(%rdi), %ymm0, %ymm1 vpaddd -192(%rdi), %ymm1, %ymm2 vpaddd -160(%rdi), %ymm2, %ymm3 vpaddd -128(%rdi), %ymm3, %ymm4 vpaddd -96(%rdi), %ymm4, %ymm5 vpaddd -64(%rdi), %ymm5, %ymm6 vpaddd -32(%rdi), %ymm6, %ymm7 # unrolling without multiple accumulators loses a lot of the benefit. cmpq %rdi, %rax jne .L24 # our single accumulator is currently in ymm7 vpxor %xmm8, %xmm8, %xmm8 # Ryzen uops: 1 latency: x vperm2i128 $33, %ymm8, %ymm7, %ymm9 # 8 3 vpaddd %ymm7, %ymm9, %ymm10 # 2 1 vperm2i128 $33, %ymm8, %ymm10, %ymm11 # 8 3 vpalignr $8, %ymm10, %ymm11, %ymm12 # 2 1 vpaddd %ymm12, %ymm10, %ymm13 # 2 1 vperm2i128 $33, %ymm8, %ymm13, %ymm14 # 8 3 vpalignr $4, %ymm13, %ymm14, %ymm15 # 2 1 vpaddd %ymm15, %ymm13, %ymm0 # 2 1 vmovd %xmm0, %eax # 1 3 vzeroupper ret Using x/ymm8-15 as src1 needs a 3-byte VEX prefix instead of 2-byte, so the epilogue should reuse xmm0-6 to save code-size. They're dead, and no x86 CPUs have write-after-write dependencies. More importantly, the shuffle strategy is just bad. There should be only one shuffle between each VPADDD. I'd suggest vextracti128 $1, %ymm7, %xmm0 vpaddd %xmm7,%xmm0,%xmm0 # Then a 128b hsum, which can use the same strategy as if we'd started with 128b vpunpckhqdq %xmm0,%xmm0,%xmm1 # Avoids an immediate, but without AVX use PSHUFD to copy+shuffle vpaddd %xmm1,%xmm0,%xmm0 vpshuflw $0x4e,%xmm0,%xmm1 # or PSHUFD, or MOVSHDUP vpaddd %xmm1,%xmm0,%xmm0 vmovd %xmm0,%eax This is faster on Haswell by a significant margin, from avoiding the lane-crossing VPERM2I128 shuffles. It's also smaller. All of these instructions are 1 uop / 1c latency on Ryzen (except movd), so this is 7 uops / 9c latency. GCC's current code is 36 uops, 17c on Ryzen. Things are similar on Bulldozer-family, vector ops have at least 2c latency. An interesting alternative is possible for the last narrowing step with BMI2: vmovq %xmm0, %rax rorx $32, %rax, %rdx add %edx, %eax RORX can run on ports 0 and 6 on Intel CPUs that support it, and it's fast on Ryzen (1c latency 0.25c throughput). If there are further vector instructions, this reduces pressure on the vector ALU ports. The only CPU where this is really good is Excavator (bdver4?), assuming vector shuffle and VPADDD are still 2c latency each, while RORX and ADD are 1c. (Agner Fog's spreadsheet doesn't have an Excavator tab). I think it's a code-size win, since integer add is only 2 bytes. (Or 3B for r8d-r15d REX), but that's still smaller than vector instructions. OTOH, RORX needs an immediate and a 3-byte VEX, so it's 1 byte bigger than VPSHUFD (unless PSHUFD used xmm8-15 and needs a 3B VEX). Without BMI2, on CPUs with efficient SHLD, you could SHLD $32, %rax, %rdx. But that has a false dependency on %rdx, so we'd have to pick a register that's known to be ready. The pointer register from the loop would work, since the other inputs to SHLD are dependent on loads using that pointer. (Inserting an xor-zeroing to break the dep would lose the tiny advantage this has). But it's slow on Bulldozer/Zen, so probably a bad idea because -mtune should avoid making code that really sucks on other CPUs when it's only a tiny gain on the tune target. --- 2x PHADDD would be even smaller code-size, but that's its only advantage. It's 3 or 4 uops on all CPUs, and has worse latency than separate shuffle and add instructions. The same argument applies to FP horizontal sums: HADDPS isn't usually a good idea either, except with -Os. Unfortunately gcc does use it there, even without -Os. HADDPS ymm is 8 uops on Ryzen, with 7c latency. Agner Fog says "mixed domain", i.e. that it wants its input in the int domain (for the shuffles) but produces a result in the FP domain, probably with a bypass delay internally. --- PSHUFLW is faster than PSHUFD on some old CPUs (like K8 and Merom/Conroe). Irrelevant for AVX, but it's not slower or lower throughput on any recent CPUs, so there's no harm in using the same strategy for SSE2 and AVX. Also, all current CPUs that support AVX have no penalty for using VMOVSHDUP between VPADDD instructions, so it's a handy way to save the immediate byte vs. VPSHUFD or VPSRLDQ. It's a poor choice without AVX for -mtune=generic, because it has a penalty on Nehalem. It's may not be worth it to add a bunch of logic to decide when to save 1 byte this way, since that's all it does except on old SlowShuffle CPUs like Core2 Merom where MOVSHDUP is faster than PSHUFD. ----- The -msse4 strategy is also sub-optimal: It can avoid the movdqa copies by using the same copy-and-shuffle instructions as AVX. # from gcc8 with -msse4 movdqa %xmm0, %xmm1 psrldq $8, %xmm1 paddd %xmm1, %xmm0 movdqa %xmm0, %xmm2 psrldq $4, %xmm2 paddd %xmm2, %xmm0 movd %xmm0, %eax When tuning for targets that don't have extra latency for FP shuffles between ivec instructions, instructions like movshdup and movhlps save code size (no immediate, and PS instructions have fewer prefixes) vs. pshufd. According to Agner Fog, Ryzen runs MOVHLPS and many other FP shuffles (including SHUFPS) in the ivec domain. (unpckh/lps/d is FP domain, though, so should be preferred for FP reductions). ------ Related: -mtune=znver1 should probably vectorize with YMM vectors, when it can be done as efficiently as with XMM vectors. As I understand it, Ryzen's maximum sustained throughput is 6 uops per clock for double-uop instructions, but only 5 uops per clock when running single-uop instructions. (And I think this is true even when running from the uop cache). Also, with gcc's tendency to not unroll and use only a single vector accumulator, using a wider vector is like unrolling with 2 accumulators. You may still bottleneck on ADDPS latency, for example, but you get twice as much work done per clock.