https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80425
Bug ID: 80425 Summary: Extra inter-unit register move with zero-extension Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ubizjak at gmail dot com Target Milestone: --- The testcase is taken from PR80381: --cut here-- #include <x86intrin.h> __m512i f1 (__m512i x, int a) { return _mm512_srai_epi32 (x, a); } --cut here-- When compiled with -O2 -mavx512f -mtune=intel, the resulting assembly reads: f1: movl %edi, %edi # 8 *zero_extendsidi2/4 [length = 2] vmovq %rdi, %xmm1 # 21 *movdi_internal/20 [length = 6] vpsrad %xmm1, %zmm0, %zmm0 # 13 ashrv16si3/1 [length = 6] ret # 24 simple_return_internal [length = 1] (insn 8) and (insn 21) could be merged to vmovd %edx, %xmm0 # 13 *zero_extendsidi2/10 [length = 6] Register allocator somehow avoids zero-extension to SSE reg in (insn 8) and generates input reload (insn 21) for (insn 13): Inserting insn reload before: 21: r100:DI=r196:DI ... Choosing alt 19 in insn 21: (0) ?*Yi (1) r {*movdi_internal} RA could choose the same (?*Yi, r) alternative in the (insn 12). REE pass also doesn't merge (insn 8) and (insn 21).