Consider the following function, which adds 1 to its argument using Intel
intrinsics:
#include <emmintrin.h>
unsigned
add1(unsigned x)
{
__m128i a = _mm_cvtsi32_si128(x);
__m128i b = _mm_add_epi32(a, _mm_set_epi32(0, 0, 0, 1));
return _mm_cvtsi128_si32(b);
}
GCC goes through memory no less than three times: once when converting x to a
vector, once when converting 1 to a vector, and once when converting the result
back to an integer:
add1:
pxor %xmm0, %xmm0
movq %rdi, -16(%rsp)
movq -16(%rsp), %xmm1
movss %xmm1, %xmm0
paddd .LC0(%rip), %xmm0
movd %xmm0, -4(%rsp)
movl -4(%rsp), %eax
ret
For comparison, here is the code generated by the Intel compiler:
add1:
movl $1, %edx
movd %edi, %xmm1
movd %edx, %xmm0
paddd %xmm0, %xmm1
movd %xmm1, %eax
ret
--
Summary: Converting between int and vector using intrinsics goes
through memory
Product: gcc
Version: 4.3.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: jch at pps dot jussieu dot fr
GCC build triplet: x86_64-linux-gnu
GCC host triplet: x86_64-linux-gnu
GCC target triplet: x86_64-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38015