https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- (In reply to Richard Biener from comment #1) > Interesting idea. It's probably a bit hard to make the vectorizer do this > though given it's current structure and the fact that it would have to > cost the extra ops against the saved suffling (the extra cost of the ops > would depent on availability of spare execution resources for example). Shuffles take execution resources just like anything else (except for load+broadcast (vbroadcastss or movddup, or even movshdup) which is done as part of a load uop in Ryzen and Intel CPUs). On Haswell and later Intel, there's only one shuffle port, but multiple ports for almost everything else (especially on Skylake). Shuffles are very often the bottleneck if you're not careful, or shuffle + something else that also only runs on port 5 (Intel). Shuffle instructions also of course take front-end throughput resources, and that's another common bottleneck in code with a mix of ops for different ports (e.g. loads, stores, ALU, integer loop overhead.) Without loop unrolling, front-end or memory bandwidth are the most likely bottlenecks for load+ALU+store loops. (Latency is often a bottleneck for reduction loops, because gcc is bad at unrolling with multiple accumulators last I looked.) A cost model that includes the front-end, and the limited throughput of of shuffles, would be a *very* good thing to have. Especially when using shuffles that take multiple instructions because of limited flexibility in the instruction set. > In general the interleaving strategy looks like a bit of a dead end with > AVX. And in bug 82136 you said: > (I guess AVX256 doesn't have full width extract even/odd > and interleave high/low ...) That's correct. AVX1 / AVX2 have very few 2-input lane-crossing shuffles. (Most of the 2-input shuffles work in two separate 128b lanes, like 2 separate 128b instructions glued together). I think the only ones that exist are vperm2i/f128 which shuffle with 128b granularity. > (and possibly worse on AVX512). Actually it's much better on AVX512. It mostly drops the in-lane limitations of AVX1/AVX2. It has powerful 2-input lane-crossing shuffles like vpermt2d / ps or vperm2tq / pd that can do the shuffles gcc wants with one instruction each, using a shuffle control constant in another vector. Clang and gcc both use them with `-march=skylake-avx512`: https://godbolt.org/g/SzwxNA. (Actually, gcc uses vpermi2pd instead of vpermt2pd, costing it an extra 2 vmovdapd instead of clobbering the load result or add results with the 2nd shuffle. Clang gets it right; see the clang tab on that godbolt link.) It's back down to 4 shuffles per 2 loads / 2 stores / add+mul, same as in the 128b case with unpckl/hpd. I.e. 2 shuffles per vector of results, but it saves a mul+add vs. the shuffle/blend AVX1 strategy. It also "tricks" gcc into unrolling and doing 2 vectors per loop iteration, reducing loop overhead for the non-PGO default of not unrolling. KNL has lower throughput for vpermt2pd (one per 2c vs. 1c for vpermilps), but it's still only 1 uop. Since Knight's Landing bottlenecks almost completely on front-end decode (2 instructions per clock), it takes about 8 clocks to issue the loop, which matches the time for 4 two-input shuffles. If out-of-order execution + hyperthreading can hide the latency, it should be fine. Skylake-avx512 (aka SKX) doesn't care about 2 or 3 input shuffles vs. 1 input (same throughput for vpermilpd vs. vpermt2pd https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0050654_SkylakeX_InstLatX64.txt). Lane-crossing shuffles have higher latency, but the shuffle isn't part of a loop-carried dependency. SKX shuts down port 1 shut down while 512b uops are in flight, so the shuffle+blend strategy with duplicated work would probably bottleneck on ALU throughput and still only manage one vector per 2 clocks. clang's output (what gcc should emit for its current strategy) is 15 front-end uops, or 3.75 cycles for the front-end, so gcc will bottleneck just barely on shuffle throughput, assuming vmulpd and vaddpd never get scheduled onto port5 and steal a cycle from a shuffle. (Should be rare, because there's much less pressure on port0).