https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709
Marc Glisse <glisse at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2017-02-24 Ever confirmed|0 |1 --- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Marc Glisse from comment #2) > In reload, subregs are extracted via the stack, whereas the low subreg > should already be available (NOP) and the high one can be extracted by a > single insn. That's probably the first thing to investigate. (-mtune doesn't > change what happens) To concentrate on this, with -O3 -mavx : typedef long int v4i __attribute__((vector_size (32))); v4i foo(v4i a, v4i b) { return a+b; } vmovdqa %ymm0, -80(%rbp) vmovdqa %ymm1, -112(%rbp) vmovdqa -80(%rbp), %xmm4 vmovdqa -64(%rbp), %xmm6 vpaddq -112(%rbp), %xmm4, %xmm3 vpaddq -96(%rbp), %xmm6, %xmm5 vmovaps %xmm3, -48(%rbp) vmovaps %xmm5, -32(%rbp) vmovdqa -48(%rbp), %ymm0 (plus overhead to align the stack, etc) compared to clang's vextractf128 $1, %ymm0, %xmm2 vextractf128 $1, %ymm1, %xmm3 vpaddq %xmm2, %xmm3, %xmm2 vpaddq %xmm0, %xmm1, %xmm0 vinsertf128 $1, %xmm2, %ymm0, %ymm0