Vector unaligned load/store x86 intrinsics

Marc Glisse Thu, 25 Aug 2016 12:41:56 -0700

Hello,

I was considering changing the implementation of _mm_loadu_pd in x86'semmintrin.h to avoid a builtin. Here are 3 versions:


typedef double __m128d __attribute__ ((__vector_size__ (16), __may_alias__));
typedef double __m128d_u __attribute__ ((__vector_size__ (16), __may_alias__, 
aligned(1)));

__m128d f (double const *__P)
{
  return __builtin_ia32_loadupd (__P);
}

__m128d g (double const *__P)
{
  return *(__m128d_u*)(__P);
}

__m128d h (double const *__P)
{
  __m128d __r;
  __builtin_memcpy (&__r, __P, 16);
  return __r;
}

f is what we have currently. f and g generate the same code. h alsogenerates the same code except at -O0 where it is slightly longer.


(note that I haven't regtested either version yet)

1) I don't have any strong preference between g and h, is there a reasonto pick one over the other? I may have a slight preference for g, whichexpands to


  __m128d _3;
  _3 = MEM[(__m128d_u * {ref-all})__P_2(D)];

while h yields

  __int128 unsigned _3;
  _3 = MEM[(char * {ref-all})__P_2(D)];
  _4 = VIEW_CONVERT_EXPR<vector(2) double>(_3);

2) Reading Intel's doc for movupd, it says: "If alignment checking isenabled (CR0.AM = 1, RFLAGS.AC = 1, and CPL = 3), an alignment-checkexception (#AC) may or may not be generated (depending on processorimplementation) when the operand is not aligned on an 8-byte boundary."Since we generate movupd for memcpy even when the alignment is presumablyonly 1 byte, I assume that this alignment-check stuff is not supported bygcc?


--
Marc Glisse

Vector unaligned load/store x86 intrinsics

Reply via email to