On Wed, Oct 30, 2013 at 11:00:13AM +0100, Jakub Jelinek wrote: > But the above is 16 byte unaligned load. Furthermore, GCC supports > -mavx256-split-unaligned-load and can emit 32 byte loads either as an > unaligned 32 byte load, or merge of 16 byte unaligned loads. The patch > affects only the cases where we were already emitting 16 byte or 32 byte > unaligned loads rather than split loads.
With my patch, the differences (in all cases only on f1) for -O2 -mavx -ftree-vectorize with the patch is (16 byte unaligned load, not split): - vmovdqu (%rsi,%rax), %xmm0 - vpmulld %xmm1, %xmm0, %xmm0 + vpmulld (%rsi,%rax), %xmm1, %xmm0 vmovups %xmm0, (%rdi,%rax) with -O2 -mavx2 -ftree-vectorize (again, load wasn't split): - vmovdqu (%rsi,%rax), %ymm0 - vpmulld %ymm1, %ymm0, %ymm0 + vpmulld (%rsi,%rax), %ymm1, %ymm0 vmovups %ymm0, (%rdi,%rax) and with -O2 -mavx2 -mavx256-split-unaligned-load: vmovdqu (%rsi,%rax), %xmm0 vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 - vpmulld %ymm1, %ymm0, %ymm0 + vpmulld %ymm0, %ymm1, %ymm0 vmovups %ymm0, (%rdi,%rax) (the last change is just giving RTL optimizers more freedom by not doing the SUBREG on the lhs). Jakub