https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000

            Bug ID: 117000
           Summary: Inefficient code for 32-byte struct comparison (ptest
                    missing)
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chfast at gmail dot com
  Target Milestone: ---

I was investigating why in GCC 13.3 the functions test1 and test2 produce
different x86 assembly. They only differ by the placement of the int -> U256
user defined conversion.

This lead to the discovery that the generated x86-64-v2 for all the examples is
not very efficient. E.g. for some reason a shift instruction is used (psrldq).

In GCC 14+ the compilation converges to test1 also in test2.

https://godbolt.org/z/r1vfcPone


using uint64_t = unsigned long;

struct U256
{
    uint64_t words_[4]{};

    U256(uint64_t v)
      : words_{v}
    {}
};

bool eq(const U256& x, const U256& y)
{
    uint64_t folded = 0;
    for (int i = 0; i < 4; ++i)
        folded |= (x.words_[i] ^ y.words_[i]);
    return folded == 0;
}

bool eqi(const U256& x, uint64_t y)
{
    return eq(x, U256(y));
}

auto test1(const U256& x)
{
    return eqi(x, uint64_t(0));
}

bool test2(const U256& x)
{
    return eq(x, U256(0));
}


test1(U256 const&):
        movdqu  xmm1, XMMWORD PTR [rdi+16]
        movdqu  xmm0, XMMWORD PTR [rdi]
        por     xmm0, xmm1
        movdqa  xmm1, xmm0
        psrldq  xmm1, 8
        por     xmm0, xmm1
        movq    rax, xmm0
        test    rax, rax
        sete    al
        ret
test2(U256 const&):
        mov     rax, QWORD PTR [rdi]
        or      rax, QWORD PTR [rdi+8]
        or      rax, QWORD PTR [rdi+16]
        or      rax, QWORD PTR [rdi+24]
        sete    al
        ret

Reply via email to