https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #18 from Mason <slash.tmp at free dot fr> ---
Hello Michael_S,
As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).
#include <x86intrin.h>
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
unsigned char c = 0;
c = _addcarry_u64(c, A[0], B[0], dst+0);
c = _addcarry_u64(c, A[1], B[1], dst+1);
c = _addcarry_u64(c, A[2], B[2], dst+2);
c = _addcarry_u64(c, A[3], B[3], dst+3);
}
On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
the expected:
add4i:
movq (%rdx), %rax
addq (%rsi), %rax
movq %rax, (%rdi)
movq 8(%rsi), %rax
adcq 8(%rdx), %rax
movq %rax, 8(%rdi)
movq 16(%rsi), %rax
adcq 16(%rdx), %rax
movq %rax, 16(%rdi)
movq 24(%rdx), %rax
adcq 24(%rsi), %rax
movq %rax, 24(%rdi)
ret
I'll run a few benchmarks to test optimal scheduling.