https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91796

            Bug ID: 91796
           Summary: Sub-optimal YMM register allocation.
           Product: gcc
           Version: 9.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: maxim.yegorushkin at gmail dot com
  Target Milestone: ---

The following code when compiled with `g++ -O3 -mavx2 -std=c++11`

    __m256d copysign2_pd(__m256d from, __m256d to) {
        auto a = _mm256_castpd_si256(from);
        auto avx_signbit =
_mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
        return _mm256_or_pd(_mm256_and_pd(avx_signbit, from),
_mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit &
to)
    }

Generates the following assembly:

    copysign2_pd(double __vector(4), double __vector(4)):
            vmovapd ymm2, ymm0
            vmovapd ymm0, YMMWORD PTR .LC3[rip]
            vandnpd ymm1, ymm0, ymm1
            vandpd  ymm0, ymm0, ymm2
            vorpd   ymm0, ymm0, ymm1
            ret
    .LC3:
            .long   0
            .long   -2147483648
            .long   0
            .long   -2147483648
            .long   0
            .long   -2147483648
            .long   0
            .long   -2147483648

In the assembly instruction `vmovapd ymm2, ymm0` is unnecessary. It can instead
load constant .LC3 directly into ymm2. The expected code is:

    copysign2_pd(double __vector(4), double __vector(4)):
            vmovapd ymm2, YMMWORD PTR .LC3[rip]
            vandnpd ymm1, ymm2, ymm1
            vandpd  ymm0, ymm2, ymm0
            vorpd   ymm0, ymm0, ymm1
            ret

Reply via email to