Hello,
I was considering changing the implementation of _mm_loadu_pd in x86's
emmintrin.h to avoid a builtin. Here are 3 versions:
typedef double __m128d __attribute__ ((__vector_size__ (16), __may_alias__));
typedef double __m128d_u __attribute__ ((__vector_size__ (16), __may_alias__,
aligned(1)));
__m128d f (double const *__P)
{
return __builtin_ia32_loadupd (__P);
}
__m128d g (double const *__P)
{
return *(__m128d_u*)(__P);
}
__m128d h (double const *__P)
{
__m128d __r;
__builtin_memcpy (&__r, __P, 16);
return __r;
}
f is what we have currently. f and g generate the same code. h also
generates the same code except at -O0 where it is slightly longer.
(note that I haven't regtested either version yet)
1) I don't have any strong preference between g and h, is there a reason
to pick one over the other? I may have a slight preference for g, which
expands to
__m128d _3;
_3 = MEM[(__m128d_u * {ref-all})__P_2(D)];
while h yields
__int128 unsigned _3;
_3 = MEM[(char * {ref-all})__P_2(D)];
_4 = VIEW_CONVERT_EXPR<vector(2) double>(_3);
2) Reading Intel's doc for movupd, it says: "If alignment checking is
enabled (CR0.AM = 1, RFLAGS.AC = 1, and CPL = 3), an alignment-check
exception (#AC) may or may not be generated (depending on processor
implementation) when the operand is not aligned on an 8-byte boundary."
Since we generate movupd for memcpy even when the alignment is presumably
only 1 byte, I assume that this alignment-check stuff is not supported by
gcc?
--
Marc Glisse