On Wed, Oct 30, 2013 at 10:53:58AM +0100, Ondřej Bílka wrote: > > Yesterday I've noticed that for AVX which allows unaligned operands in > > AVX arithmetics instructions we still don't combine unaligned loads with the > > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize > > void > > f1 (int *__restrict e, int *__restrict f) > > { > > int i; > > for (i = 0; i < 1024; i++) > > e[i] = f[i] * 7; > > } > > > > void > > f2 (int *__restrict e, int *__restrict f) > > { > > int i; > > for (i = 0; i < 1024; i++) > > e[i] = f[i]; > > } > > we have: > > vmovdqu (%rsi,%rax), %xmm0 > > vpmulld %xmm1, %xmm0, %xmm0 > > vmovups %xmm0, (%rdi,%rax) > > in the first loop. Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT > > *mov<mode>_internal patterns (and various others) use misaligned_operand > > to see if they should emit vmovaps or vmovups (etc.), so as suggested by > > That is intentional. In pre-haswell architectures splitting load is > faster than 32 byte load.
But the above is 16 byte unaligned load. Furthermore, GCC supports -mavx256-split-unaligned-load and can emit 32 byte loads either as an unaligned 32 byte load, or merge of 16 byte unaligned loads. The patch affects only the cases where we were already emitting 16 byte or 32 byte unaligned loads rather than split loads. Jakub