On Wed, Oct 30, 2013 at 11:00:13AM +0100, Jakub Jelinek wrote:
> But the above is 16 byte unaligned load.  Furthermore, GCC supports
> -mavx256-split-unaligned-load and can emit 32 byte loads either as an
> unaligned 32 byte load, or merge of 16 byte unaligned loads.  The patch
> affects only the cases where we were already emitting 16 byte or 32 byte
> unaligned loads rather than split loads.

With my patch, the differences (in all cases only on f1) for
-O2 -mavx -ftree-vectorize with the patch is (16 byte unaligned load, not 
split):
-       vmovdqu (%rsi,%rax), %xmm0
-       vpmulld %xmm1, %xmm0, %xmm0
+       vpmulld (%rsi,%rax), %xmm1, %xmm0
        vmovups %xmm0, (%rdi,%rax)
with -O2 -mavx2 -ftree-vectorize (again, load wasn't split):
-       vmovdqu (%rsi,%rax), %ymm0
-       vpmulld %ymm1, %ymm0, %ymm0
+       vpmulld (%rsi,%rax), %ymm1, %ymm0
        vmovups %ymm0, (%rdi,%rax)
and with -O2 -mavx2 -mavx256-split-unaligned-load:
        vmovdqu (%rsi,%rax), %xmm0
        vinserti128     $0x1, 16(%rsi,%rax), %ymm0, %ymm0
-       vpmulld %ymm1, %ymm0, %ymm0
+       vpmulld %ymm0, %ymm1, %ymm0
        vmovups %ymm0, (%rdi,%rax)
(the last change is just giving RTL optimizers more freedom by not
doing the SUBREG on the lhs).

        Jakub

Reply via email to