http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52607
--- Comment #29 from Marc Glisse <marc.glisse at normalesup dot org> 2012-04-11 20:35:00 UTC --- Created attachment 27136 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27136 V4DF generic shuffle A patch (independent from the others) implementing what is explained in the last 2 comments. It is simple and works really well, all V4DF shuffles (even with 2 vectors) take only 3 insn (and often just 2). It only requires AVX, but also improves a lot on the current AVX2 code which casts to vectors of integers and uses up to 9 insn (although my "default case" patch also goes down to 3 insn on AVX2). The drawback is that it is limited to V4DF. vshufps is a different enough beast from vshufpd that it would require a different code, which wouldn't even apply that often. For V8SF, my "default case" patch seems more interesting. Integer vectors have different instructions again... By the way, I tested all V4DF permutations (there are only 2^12 of them) in the simulator. I also have a file (400K) with the code for each permutation, that looks like the following: 0,0,0,0 vpermilpd $0, %ymm0, %ymm0 vperm2f128 $0, %ymm0, %ymm0, %ymm0 [...] 1,7,6,3 vperm2f128 $48, %ymm1, %ymm0, %ymm2 vperm2f128 $19, %ymm1, %ymm0, %ymm0 vshufpd $11, %ymm0, %ymm2, %ymm0 1,7,6,4 vperm2f128 $48, %ymm1, %ymm0, %ymm0 vperm2f128 $33, %ymm1, %ymm1, %ymm1 vshufpd $3, %ymm1, %ymm0, %ymm0 [...] If anyone wants to take a look, tell me and I'll attach it.