[Bug tree-optimization/82137] auto-vectorizing shuffles way to much to avoid duplicate work

peter at cordes dot ca Tue, 12 Sep 2017 11:23:54 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137


--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Richard Biener from comment #1)
> Interesting idea.  It's probably a bit hard to make the vectorizer do this
> though given it's current structure and the fact that it would have to
> cost the extra ops against the saved suffling (the extra cost of the ops
> would depent on availability of spare execution resources for example).

 Shuffles take execution resources just like anything else (except for
load+broadcast (vbroadcastss or movddup, or even movshdup) which is done as
part of a load uop in Ryzen and Intel CPUs).

On Haswell and later Intel, there's only one shuffle port, but multiple ports
for almost everything else (especially on Skylake).  Shuffles are very often
the bottleneck if you're not careful, or shuffle + something else that also
only runs on port 5 (Intel).

Shuffle instructions also of course take front-end throughput resources, and
that's another common bottleneck in code with a mix of ops for different ports
(e.g. loads, stores, ALU, integer loop overhead.)  Without loop unrolling,
front-end or memory bandwidth are the most likely bottlenecks for
load+ALU+store loops.  (Latency is often a bottleneck for reduction loops,
because gcc is bad at unrolling with multiple accumulators last I looked.)

A cost model that includes the front-end, and the limited throughput of of
shuffles, would be a *very* good thing to have.  Especially when using shuffles
that take multiple instructions because of limited flexibility in the
instruction set.


> In general the interleaving strategy looks like a bit of a dead end with
> AVX.

And in bug 82136 you said:
> (I guess AVX256 doesn't have full width extract even/odd
> and interleave high/low ...)

That's correct.  AVX1 / AVX2 have very few 2-input lane-crossing shuffles. 
(Most of the 2-input shuffles work in two separate 128b lanes, like 2 separate
128b instructions glued together).

I think the only ones that exist are vperm2i/f128 which shuffle with 128b
granularity.


> (and possibly worse on AVX512).

Actually it's much better on AVX512.  It mostly drops the in-lane limitations
of AVX1/AVX2.  It has powerful 2-input lane-crossing shuffles like vpermt2d /
ps or vperm2tq / pd that can do the shuffles gcc wants with one instruction
each, using a shuffle control constant in another vector.

Clang and gcc both use them with `-march=skylake-avx512`:
https://godbolt.org/g/SzwxNA.  (Actually, gcc uses vpermi2pd instead of
vpermt2pd, costing it an extra 2 vmovdapd instead of clobbering the load result
or add results with the 2nd shuffle.  Clang gets it right; see the clang tab on
that godbolt link.)

It's back down to 4 shuffles per 2 loads / 2 stores / add+mul, same as in the
128b case with unpckl/hpd.  I.e. 2 shuffles per vector of results, but it saves
a mul+add vs. the shuffle/blend AVX1 strategy.  It also "tricks" gcc into
unrolling and doing 2 vectors per loop iteration, reducing loop overhead for
the non-PGO default of not unrolling.

KNL has lower throughput for vpermt2pd (one per 2c vs. 1c for vpermilps), but
it's still only 1 uop.  Since Knight's Landing bottlenecks almost completely on
front-end decode (2 instructions per clock), it takes about 8 clocks to issue
the loop, which matches the time for 4 two-input shuffles.  If out-of-order
execution + hyperthreading can hide the latency, it should be fine.

Skylake-avx512 (aka SKX) doesn't care about 2 or 3 input shuffles vs. 1 input
(same throughput for vpermilpd vs. vpermt2pd
https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0050654_SkylakeX_InstLatX64.txt).
 Lane-crossing shuffles have higher latency, but the shuffle isn't part of a
loop-carried dependency.

SKX shuts down port 1 shut down while 512b uops are in flight, so the
shuffle+blend strategy with duplicated work would probably bottleneck on ALU
throughput and still only manage one vector per 2 clocks.  clang's output (what
gcc should emit for its current strategy) is 15 front-end uops, or 3.75 cycles
for the front-end, so gcc will bottleneck just barely on shuffle throughput,
assuming vmulpd and vaddpd never get scheduled onto port5 and steal a cycle
from a shuffle.  (Should be rare, because there's much less pressure on port0).

[Bug tree-optimization/82137] auto-vectorizing shuffles way to much to avoid duplicate work

Reply via email to