https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120799

--- Comment #5 from Justus <justus2510 at proton dot me> ---
I understand that C considers this UB, but I don't understand how this is
different from, say, _mm_storel_epi64? It's also technically UB. The only
difference is one takes a double * and the other takes an __m128i *. Other than
that, the wording in the Intel manual is pretty much the same.

>From a practical perspective, storing the upper 8 bytes of an
__m128/__m128i/__m128d value can be extremely useful in certain cases (I
specifically need this functionality, which is how I discovered this bug in the
first place). If _mm_storeh_pd won't work, there are 2 options. You can add an
extra shift/unpack and then use _mm_storel_epi64. Unfortunately, this doesn't
get picked up by the optimizer, so this is a non-starter. Or you use
_mm_storeh_pd with a temporary double variable, and then use memcpy to copy the
value into the original unaligned buffer. This does get picked up by GCC's
optimizer (since GCC 11), but not by Clang's, so I've been forced to #ifdef
this so I don't have an extra instruction in there. It's very annoying.

I guess my main question is this: Why should _mm_storel_pd/_mm_storeh_pd be
treated differently than _mm_storel_epi64?

Reply via email to