The code for a simple loop like for (i = 0; i < LENGTH-1; i++) { g_c[i] = g_a[i] + g_b[i]; }
looks good for g++ (4.9.0 20131028 (experimental)) (-O3 core-avx2) .L2: vmovdqa g_a(%rax), %ymm0 # 26 *movv8si_internal/2 [length = 8] vpaddd g_b(%rax), %ymm0, %ymm0 # 27 *addv8si3/2 [length = 8] addq $32, %rax # 29 *adddi_1/1 [length = 4] vmovaps %ymm0, g_c-32(%rax) # 28 *movv8si_internal/3 [length = 8] cmpq $39968, %rax # 31 *cmpdi_1/1 [length = 6] jne .L2 # 32 *jcc_1 [length = 2] but for gcc, I'm getting .L4: vmovdqu (%rsi,%rax), %xmm0 # 156 sse2_loaddquv16qi [length = 5] vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 # 157 avx_vec_concatv32qi/1 [length = 8] addl $1, %edx # 161 *addsi_1/1 [length = 3] vpaddd (%rdi,%rax), %ymm0, %ymm0 # 158 *addv8si3/2 [length = 5] vmovups %xmm0, (%rcx,%rax) # 412 *movv16qi_internal/3 [length = 5] vextracti128 $0x1, %ymm0, 16(%rcx,%rax) # 160 vec_extract_hi_v32qi/2 [length = 8] addq $32, %rax # 162 *adddi_1/1 [length = 4] cmpl $1248, %edx # 164 *cmpsi_1/1 [length = 6] jbe .L4 # 165 *jcc_1 [length = 2] unless I add "__attribute__ ((aligned (64)));" g_a, g_b, g_c. 2 questions: Does C have different alignment requirements/specs than C++ (I don't think so)? But if so, why does gcc not just align the arrays (they are in the same module in my example...)? Let aside the alignment question, why not just do avx2 (ymm) moves as g++ does? Guess my question is, is this a bug or a feature? Thanks, Regards, Hendrik