[Bug target/80846] New: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

peter at cordes dot ca Sat, 20 May 2017 20:03:42 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

Bug ID: 80846
Summary: auto-vectorized AVX2 horizontal sum should narrow to
128b right away, to be more efficient for Ryzen and
Intel
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

gcc's tune=generic strategy for horizontal sums at the end of an AVX2
auto-vectorized reduction is sub-optimal even for Intel CPUs, and horrible for
CPUs that split 256b ops into 2 uops (i.e. AMD including Ryzen).

The first step should always be vextracti/f128 to reduce down to xmm vectors.
It has low latency and good throughput on all CPUs. Keep narrowing in half
until you're down to one element. See also
http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026

The one exception is a horizontal sum of 8-bit elements without overflow, where
you should use VPSADBW ymm against a zeroed to do a horizontal add without
overflow, and then extract hsum the resulting four 64-bit values. (For signed
8-bit, you can range-shift to unsigned and then correct the scalar hsum
result.)

----

gcc tends to keep working with ymm vectors until the last step, even using
VPERM2I128 (which is horrible on AMD CPUs, e.g. Ryzen: 8 uops with 3c
throughput/latency vs. VEXTRACTI128 being 1 uop with 1c latency and 0.33c
throughput).

// https://godbolt.org/g/PwX6yT
int sumint(const int arr[]) {
arr = __builtin_assume_aligned(arr, 64);
int sum=0;
for (int i=0 ; i<1024 ; i++)
sum+=arr[i];
return sum;
}

Compiled with gcc8.0.0 20170520 -mavx2 -funroll-loops -O3 -std=gnu11, we get

vpxor %xmm7, %xmm7, %xmm7
leaq 4096(%rdi), %rax
.L24:
vpaddd (%rdi), %ymm7, %ymm0
addq $256, %rdi # doing this later would let more
instructions use a disp8 instead of disp32
vpaddd -224(%rdi), %ymm0, %ymm1
vpaddd -192(%rdi), %ymm1, %ymm2
vpaddd -160(%rdi), %ymm2, %ymm3
vpaddd -128(%rdi), %ymm3, %ymm4
vpaddd -96(%rdi), %ymm4, %ymm5
vpaddd -64(%rdi), %ymm5, %ymm6
vpaddd -32(%rdi), %ymm6, %ymm7 # unrolling without multiple
accumulators loses a lot of the benefit.
cmpq %rdi, %rax
jne .L24

# our single accumulator is currently in ymm7
vpxor %xmm8, %xmm8, %xmm8 # Ryzen uops: 1 latency: x
vperm2i128 $33, %ymm8, %ymm7, %ymm9 # 8 3
vpaddd %ymm7, %ymm9, %ymm10 # 2 1
vperm2i128 $33, %ymm8, %ymm10, %ymm11 # 8 3
vpalignr $8, %ymm10, %ymm11, %ymm12 # 2 1
vpaddd %ymm12, %ymm10, %ymm13 # 2 1
vperm2i128 $33, %ymm8, %ymm13, %ymm14 # 8 3
vpalignr $4, %ymm13, %ymm14, %ymm15 # 2 1
vpaddd %ymm15, %ymm13, %ymm0 # 2 1
vmovd %xmm0, %eax # 1 3

vzeroupper
ret

Using x/ymm8-15 as src1 needs a 3-byte VEX prefix instead of 2-byte, so the
epilogue should reuse xmm0-6 to save code-size. They're dead, and no x86 CPUs
have write-after-write dependencies.

More importantly, the shuffle strategy is just bad. There should be only one
shuffle between each VPADDD. I'd suggest

vextracti128 $1, %ymm7, %xmm0
vpaddd %xmm7,%xmm0,%xmm0
# Then a 128b hsum, which can use the same strategy as if we'd started with
128b
vpunpckhqdq %xmm0,%xmm0,%xmm1 # Avoids an immediate, but without AVX use
PSHUFD to copy+shuffle
vpaddd %xmm1,%xmm0,%xmm0
vpshuflw $0x4e,%xmm0,%xmm1 # or PSHUFD, or MOVSHDUP
vpaddd %xmm1,%xmm0,%xmm0
vmovd %xmm0,%eax

This is faster on Haswell by a significant margin, from avoiding the
lane-crossing VPERM2I128 shuffles. It's also smaller.

All of these instructions are 1 uop / 1c latency on Ryzen (except movd), so
this is 7 uops / 9c latency. GCC's current code is 36 uops, 17c on Ryzen.
Things are similar on Bulldozer-family, vector ops have at least 2c latency.

An interesting alternative is possible for the last narrowing step with BMI2:

vmovq %xmm0, %rax
rorx $32, %rax, %rdx
add %edx, %eax

RORX can run on ports 0 and 6 on Intel CPUs that support it, and it's fast on
Ryzen (1c latency 0.25c throughput). If there are further vector instructions,
this reduces pressure on the vector ALU ports. The only CPU where this is
really good is Excavator (bdver4?), assuming vector shuffle and VPADDD are
still 2c latency each, while RORX and ADD are 1c. (Agner Fog's spreadsheet
doesn't have an Excavator tab).

I think it's a code-size win, since integer add is only 2 bytes. (Or 3B for
r8d-r15d REX), but that's still smaller than vector instructions. OTOH, RORX
needs an immediate and a 3-byte VEX, so it's 1 byte bigger than VPSHUFD (unless
PSHUFD used xmm8-15 and needs a 3B VEX).

Without BMI2, on CPUs with efficient SHLD, you could SHLD $32, %rax, %rdx. But
that has a false dependency on %rdx, so we'd have to pick a register that's
known to be ready. The pointer register from the loop would work, since the
other inputs to SHLD are dependent on loads using that pointer. (Inserting an
xor-zeroing to break the dep would lose the tiny advantage this has). But it's
slow on Bulldozer/Zen, so probably a bad idea because -mtune should avoid
making code that really sucks on other CPUs when it's only a tiny gain on the
tune target.

---

2x PHADDD would be even smaller code-size, but that's its only advantage. It's
3 or 4 uops on all CPUs, and has worse latency than separate shuffle and add
instructions. The same argument applies to FP horizontal sums: HADDPS isn't
usually a good idea either, except with -Os. Unfortunately gcc does use it
there, even without -Os. HADDPS ymm is 8 uops on Ryzen, with 7c latency.
Agner Fog says "mixed domain", i.e. that it wants its input in the int domain
(for the shuffles) but produces a result in the FP domain, probably with a
bypass delay internally.

---

PSHUFLW is faster than PSHUFD on some old CPUs (like K8 and Merom/Conroe).
Irrelevant for AVX, but it's not slower or lower throughput on any recent CPUs,
so there's no harm in using the same strategy for SSE2 and AVX.

Also, all current CPUs that support AVX have no penalty for using VMOVSHDUP
between VPADDD instructions, so it's a handy way to save the immediate byte vs.
VPSHUFD or VPSRLDQ. It's a poor choice without AVX for -mtune=generic, because
it has a penalty on Nehalem. It's may not be worth it to add a bunch of logic
to decide when to save 1 byte this way, since that's all it does except on old
SlowShuffle CPUs like Core2 Merom where MOVSHDUP is faster than PSHUFD.

-----

The -msse4 strategy is also sub-optimal: It can avoid the movdqa copies by
using the same copy-and-shuffle instructions as AVX.

# from gcc8 with -msse4
movdqa %xmm0, %xmm1
psrldq $8, %xmm1
paddd %xmm1, %xmm0
movdqa %xmm0, %xmm2
psrldq $4, %xmm2
paddd %xmm2, %xmm0
movd %xmm0, %eax

When tuning for targets that don't have extra latency for FP shuffles between
ivec instructions, instructions like movshdup and movhlps save code size (no
immediate, and PS instructions have fewer prefixes) vs. pshufd. According to
Agner Fog, Ryzen runs MOVHLPS and many other FP shuffles (including SHUFPS) in
the ivec domain. (unpckh/lps/d is FP domain, though, so should be preferred
for FP reductions).

------

Related: -mtune=znver1 should probably vectorize with YMM vectors, when it can
be done as efficiently as with XMM vectors. As I understand it, Ryzen's
maximum sustained throughput is 6 uops per clock for double-uop instructions,
but only 5 uops per clock when running single-uop instructions. (And I think
this is true even when running from the uop cache).

Also, with gcc's tendency to not unroll and use only a single vector
accumulator, using a wider vector is like unrolling with 2 accumulators. You
may still bottleneck on ADDPS latency, for example, but you get twice as much
work done per clock.

[Bug target/80846] New: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to