http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59539

            Bug ID: 59539
           Summary: Missed optimisation: VEX-prefixed operations don't
                    need aligned data
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thiago at kde dot org

Consider the following code:

#include <immintrin.h>
int f(void *p1, void *p2)
{
    __m128i d1 = _mm_loadu_si128((__m128i*)p1);
    __m128i d2 = _mm_loadu_si128((__m128i*)p2);
    __m128i result = _mm_cmpeq_epi16(d1, d2);
    return _mm_movemask_epi8(result);
}

If compiled with -O2 -mavx, it produces the following code with GCC 4.9
(current trunk):
f:
        vmovdqu (%rdi), %xmm0
        vmovdqu (%rsi), %xmm1
        vpcmpeqw        %xmm1, %xmm0, %xmm0
        vpmovmskb       %xmm0, %eax
        ret

One of the two VMOVDQU are unnecessary, since the VEX-prefixed VCMPEQW
instruction can do unaligned loads without faulting. The Intel Software
Developer's Manual Volume 1, Chapter 14 says in 14.9 "Memory alignment":

> With the exception of explicitly aligned 16 or 32 byte SIMD load/store 
> instructions, most VEX-encoded,
> arithmetic and data processing instructions operate in a flexible environment 
> regarding memory address
> alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load 
> semantics will support unaligned load
> operation by default. Memory arguments for most instructions with VEX prefix 
> operate normally without
> causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE 
> instructions). The instructions that
> require explicit memory alignment requirements are listed in Table 14-22.

Clang and ICC have already implemente this optimisation:

Clang 3.3 produces:
f:                                      # @f
        vmovdqu (%rsi), %xmm0
        vpcmpeqw        (%rdi), %xmm0, %xmm0
        vpmovmskb       %xmm0, %eax
        ret

Similarly, ICC 14 produces:
f:
        vmovdqu   (%rdi), %xmm0
        vpcmpeqw  (%rsi), %xmm0, %xmm1
        vpmovmskb %xmm1, %eax
        ret

Reply via email to