http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539
Bug ID: 59539 Summary: Missed optimisation: VEX-prefixed operations don't need aligned data Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Consider the following code: #include <immintrin.h> int f(void *p1, void *p2) { __m128i d1 = _mm_loadu_si128((__m128i*)p1); __m128i d2 = _mm_loadu_si128((__m128i*)p2); __m128i result = _mm_cmpeq_epi16(d1, d2); return _mm_movemask_epi8(result); } If compiled with -O2 -mavx, it produces the following code with GCC 4.9 (current trunk): f: vmovdqu (%rdi), %xmm0 vmovdqu (%rsi), %xmm1 vpcmpeqw %xmm1, %xmm0, %xmm0 vpmovmskb %xmm0, %eax ret One of the two VMOVDQU are unnecessary, since the VEX-prefixed VCMPEQW instruction can do unaligned loads without faulting. The Intel Software Developer's Manual Volume 1, Chapter 14 says in 14.9 "Memory alignment": > With the exception of explicitly aligned 16 or 32 byte SIMD load/store > instructions, most VEX-encoded, > arithmetic and data processing instructions operate in a flexible environment > regarding memory address > alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load > semantics will support unaligned load > operation by default. Memory arguments for most instructions with VEX prefix > operate normally without > causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE > instructions). The instructions that > require explicit memory alignment requirements are listed in Table 14-22. Clang and ICC have already implemente this optimisation: Clang 3.3 produces: f: # @f vmovdqu (%rsi), %xmm0 vpcmpeqw (%rdi), %xmm0, %xmm0 vpmovmskb %xmm0, %eax ret Similarly, ICC 14 produces: f: vmovdqu (%rdi), %xmm0 vpcmpeqw (%rsi), %xmm0, %xmm1 vpmovmskb %xmm1, %eax ret