Hello,

I was considering changing the implementation of _mm_loadu_pd in x86's emmintrin.h to avoid a builtin. Here are 3 versions:

typedef double __m128d __attribute__ ((__vector_size__ (16), __may_alias__));
typedef double __m128d_u __attribute__ ((__vector_size__ (16), __may_alias__, 
aligned(1)));

__m128d f (double const *__P)
{
  return __builtin_ia32_loadupd (__P);
}

__m128d g (double const *__P)
{
  return *(__m128d_u*)(__P);
}

__m128d h (double const *__P)
{
  __m128d __r;
  __builtin_memcpy (&__r, __P, 16);
  return __r;
}


f is what we have currently. f and g generate the same code. h also generates the same code except at -O0 where it is slightly longer.

(note that I haven't regtested either version yet)

1) I don't have any strong preference between g and h, is there a reason to pick one over the other? I may have a slight preference for g, which expands to

  __m128d _3;
  _3 = MEM[(__m128d_u * {ref-all})__P_2(D)];

while h yields

  __int128 unsigned _3;
  _3 = MEM[(char * {ref-all})__P_2(D)];
  _4 = VIEW_CONVERT_EXPR<vector(2) double>(_3);


2) Reading Intel's doc for movupd, it says: "If alignment checking is enabled (CR0.AM = 1, RFLAGS.AC = 1, and CPL = 3), an alignment-check exception (#AC) may or may not be generated (depending on processor implementation) when the operand is not aligned on an 8-byte boundary." Since we generate movupd for memcpy even when the alignment is presumably only 1 byte, I assume that this alignment-check stuff is not supported by gcc?

--
Marc Glisse

Reply via email to