http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539
--- Comment #2 from Thiago Macieira <thiago at kde dot org> --- I have to use _mm_loadu_si128 because non-VEX SSE requires explicit unaligned loads. Here's more food for thought: __m128i result = _mm_cmpeq_epi16((__m128i*)p1, (__m128i*)p2); For non-VEX code, so far the compiler emitted one MOVDQA and one PCMPEQW if it could, enforcing that both sources needed to be aligned. With VEX, VPCMPEQW can do unaligned, so should the other load also be changed to VPMOVDQU instead of VPMOVDQA? Similarly, if I use _mm_load_si128 (not loadu), can the compiler combine one load into the next instruction? Performance-wise, the execution will be the same, with one fewer instruction to be retired (so, better); but it will not cause an unaligned fault if the pointer isn't aligned.