https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000
Bug ID: 117000 Summary: Inefficient code for 32-byte struct comparison (ptest missing) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: chfast at gmail dot com Target Milestone: --- I was investigating why in GCC 13.3 the functions test1 and test2 produce different x86 assembly. They only differ by the placement of the int -> U256 user defined conversion. This lead to the discovery that the generated x86-64-v2 for all the examples is not very efficient. E.g. for some reason a shift instruction is used (psrldq). In GCC 14+ the compilation converges to test1 also in test2. https://godbolt.org/z/r1vfcPone using uint64_t = unsigned long; struct U256 { uint64_t words_[4]{}; U256(uint64_t v) : words_{v} {} }; bool eq(const U256& x, const U256& y) { uint64_t folded = 0; for (int i = 0; i < 4; ++i) folded |= (x.words_[i] ^ y.words_[i]); return folded == 0; } bool eqi(const U256& x, uint64_t y) { return eq(x, U256(y)); } auto test1(const U256& x) { return eqi(x, uint64_t(0)); } bool test2(const U256& x) { return eq(x, U256(0)); } test1(U256 const&): movdqu xmm1, XMMWORD PTR [rdi+16] movdqu xmm0, XMMWORD PTR [rdi] por xmm0, xmm1 movdqa xmm1, xmm0 psrldq xmm1, 8 por xmm0, xmm1 movq rax, xmm0 test rax, rax sete al ret test2(U256 const&): mov rax, QWORD PTR [rdi] or rax, QWORD PTR [rdi+8] or rax, QWORD PTR [rdi+16] or rax, QWORD PTR [rdi+24] sete al ret