On Wed, Oct 30, 2013 at 10:47:13AM +0100, Jakub Jelinek wrote:
> Hi!
>
> Yesterday I've noticed that for AVX which allows unaligned operands in
> AVX arithmetics instructions we still don't combine unaligned loads with the
> AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize
> void
> f1 (int *__restrict e, int *__restrict f)
> {
> int i;
> for (i = 0; i < 1024; i++)
> e[i] = f[i] * 7;
> }
>
> void
> f2 (int *__restrict e, int *__restrict f)
> {
> int i;
> for (i = 0; i < 1024; i++)
> e[i] = f[i];
> }
> we have:
> vmovdqu (%rsi,%rax), %xmm0
> vpmulld %xmm1, %xmm0, %xmm0
> vmovups %xmm0, (%rdi,%rax)
> in the first loop. Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT
> *mov<mode>_internal patterns (and various others) use misaligned_operand
> to see if they should emit vmovaps or vmovups (etc.), so as suggested by
That is intentional. In pre-haswell architectures splitting load is
faster than 32 byte load.
See Intel® 64 and IA-32 Architectures Optimization Reference Manual for
details.