https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120799
--- Comment #5 from Justus <justus2510 at proton dot me> --- I understand that C considers this UB, but I don't understand how this is different from, say, _mm_storel_epi64? It's also technically UB. The only difference is one takes a double * and the other takes an __m128i *. Other than that, the wording in the Intel manual is pretty much the same. >From a practical perspective, storing the upper 8 bytes of an __m128/__m128i/__m128d value can be extremely useful in certain cases (I specifically need this functionality, which is how I discovered this bug in the first place). If _mm_storeh_pd won't work, there are 2 options. You can add an extra shift/unpack and then use _mm_storel_epi64. Unfortunately, this doesn't get picked up by the optimizer, so this is a non-starter. Or you use _mm_storeh_pd with a temporary double variable, and then use memcpy to copy the value into the original unaligned buffer. This does get picked up by GCC's optimizer (since GCC 11), but not by Clang's, so I've been forced to #ifdef this so I don't have an extra instruction in there. It's very annoying. I guess my main question is this: Why should _mm_storel_pd/_mm_storeh_pd be treated differently than _mm_storel_epi64?