https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81496
--- Comment #3 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Jakub Jelinek from comment #0) > With -O2 -mavx{,2,512f}, we get on the following testcase: > > typedef __int128 V __attribute__((vector_size (32))); > typedef long long W __attribute__((vector_size (32))); > typedef int X __attribute__((vector_size (16))); > typedef __int128 Y __attribute__((vector_size (64))); > typedef long long Z __attribute__((vector_size (64))); > > W f1 (__int128 x, __int128 y) { return (W) ((V) { x, y }); } > W f2 (__int128 x, __int128 y) { return (W) ((V) { y, x }); } > > movq %rdi, -16(%rsp) > movq %rsi, -8(%rsp) > movq %rdx, -32(%rsp) > movq %rcx, -24(%rsp) > vmovdqa -32(%rsp), %xmm0 > vmovdqa -16(%rsp), %xmm1 > vinserti128 $0x1, %xmm0, %ymm1, %ymm0 > for f1, which I'm afraid is hard to do anything about, because RA didn't see > the usefulness to spill in different order, but for f2: > movq %rdx, -32(%rsp) > movq %rcx, -24(%rsp) > vmovdqa -32(%rsp), %xmm0 > movq %rdi, -16(%rsp) > movq %rsi, -8(%rsp) > vinserti128 $0x1, -16(%rsp), %ymm0, %ymm0 > Before scheduling, the movdqa is next to vinserti128 from the adjacent mem; > in that case it might be a win to use a vmovdqa -32(%rsp), %ymm0 instead. > Though, the MEM has just A128 in the rtl dump, so maybe we need to use > vmovdqu instead, unless we can prove it is 256-bit aligned (it is in this > case, but not generally). Maybe we can introduce a helper similar to movdi_to_sse on 32bit targets, but to handle TImode on 64bit targets?