https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91796
Bug ID: 91796 Summary: Sub-optimal YMM register allocation. Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: maxim.yegorushkin at gmail dot com Target Milestone: --- The following code when compiled with `g++ -O3 -mavx2 -std=c++11` __m256d copysign2_pd(__m256d from, __m256d to) { auto a = _mm256_castpd_si256(from); auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63)); return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to) } Generates the following assembly: copysign2_pd(double __vector(4), double __vector(4)): vmovapd ymm2, ymm0 vmovapd ymm0, YMMWORD PTR .LC3[rip] vandnpd ymm1, ymm0, ymm1 vandpd ymm0, ymm0, ymm2 vorpd ymm0, ymm0, ymm1 ret .LC3: .long 0 .long -2147483648 .long 0 .long -2147483648 .long 0 .long -2147483648 .long 0 .long -2147483648 In the assembly instruction `vmovapd ymm2, ymm0` is unnecessary. It can instead load constant .LC3 directly into ymm2. The expected code is: copysign2_pd(double __vector(4), double __vector(4)): vmovapd ymm2, YMMWORD PTR .LC3[rip] vandnpd ymm1, ymm2, ymm1 vandpd ymm0, ymm2, ymm0 vorpd ymm0, ymm0, ymm1 ret