http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59084
Bug ID: 59084 Summary: Sub-optimal vector moves in AVX2 vectorized loop for unaligned loads. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: hendrik.greving.intel at gmail dot com The simple test case below produces sub-optimal split load/stores (AVX1/16 bytes), apparently due to the fact that g_a, g_b, g_c are put in a common section which doesn't guarantee alignment. Compiling with -fno-common actually produces good code. Only affects C, due to the described alignment issue above. This bug might be related to or be a duplicate of #41464. Sub-optimal code: compiled with gcc -S -O3 -march=core-avx2 foo.c -ftree-vectorizer-verbose=1 -dp -v -da vmovdqu (%rsi,%rax), %xmm0 # 160 sse2_loaddquv16qi [length = 5] vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 # 161 avx_vec_concatv32qi/1 [length = 8] addl $1, %edx # 165 *addsi_1/1 [length = 3] vpaddd (%r8,%rax), %ymm0, %ymm0 # 162 *addv8si3/2 [length = 6] vmovups %xmm0, (%rcx,%rax) # 410 *movv16qi_internal/3 [length = 5] vextracti128 $0x1, %ymm0, 16(%rcx,%rax) # 164 vec_extract_hi_v32qi/2 [leng Good code: compiled with gcc -S -O3 -march=core-avx2 foo.c -ftree-vectorizer-verbose=1 -dp -v -da -fno-common vmovdqa g_a(%rax), %ymm0 # 26 *movv8si_internal/2 [length = 8] vpaddd g_b(%rax), %ymm0, %ymm0 # 27 *addv8si3/2 [length = 8] addq $32, %rax # 29 *adddi_1/1 [length = 4] vmovaps %ymm0, g_c-32(%rax) # 28 *movv8si_internal/3 [length = 8] Test case: #include <stdio.h> #include <stdint.h> #define LENGTH 10000 int g_a[LENGTH]; int g_b[LENGTH]; int g_c[LENGTH]; void foo() { int i ; for (i = 0; i < LENGTH; i++) { g_c[i] = g_a[i] + g_b[i]; } }