Hi, I've been working on vectorization-related optimization lately. GCC seems to have some optimization vulnerabilities. I would like to ask if it can be solved.
For example, for the following program using AVX2: #include <immtrin.h> // reg->node2[i].state is an unsigned long long variable // reg->size is an integer variable that represents the iterations for (int i = 0; i < reg->size; i+=4) { /* original code: unsigned long long state = reg->node2[i].state; if (state & (1LLU << j + 1 | 1LLU << width + j)) state ^= (1LLU << j); state ^= (1LLU << width + j); */ __m256i state = _mm256_loadu_si256((__m256i *)((char*)(reg->node2) + i * sizeof(unsigned long long))); __m256i mask1 = _mm256_set1_epi64x(1LLU << j + 1 | 1LLU << width + j); // cmp __m256i tmp1 = _mm256_and_si256(state, mask1); __m256i cmp1 = _mm256_cmpeq_epi64(tmp1, mask1); // xor __m256i xor_param = _mm256_set1_epi64x(1LLU << j); __m256i tmp2 = _mm256_and_si256(xor_param, cmp1); __m256i xor_result = _mm256_xor_si256(state, tmp2); // xor __m256i xor_param2 = _mm256_set1_epi64x(1LLU << width + j); __m256i xor_res2 = _mm256_xor_si256(xor_result, xor_param2); _mm256_storeu_si256((__m256i *)((char*)(reg->node2) + i * sizeof(unsigned long long)), xor_res2); } My expectation is to generate assembly code like this: vpxor ymm6, ymm2, ymmword ptr [r9+r15*8] vpand ymm4, ymm1, ymm6 vpcmpeqq ymm5, ymm4, ymm1 vpand ymm7, ymm3, ymm5 vpxor ymm8, ymm6, ymm7 vmovdqu ymmword ptr [r9+r15*8], ymm8 But the actual generated assembly code looks like this: vpand ymm0, ymm2, ymmword ptr [rsi+rax*8] vpxor ymm1, ymm4, ymmword ptr [rsi+rax*8] vpcmpeqq ymm0, ymm0, ymm2 vpand ymm0, ymm0, ymm5 vpxor ymm0, ymm0, ymm1 vmovdqu ymmword ptr [rsi+rax*8], ymm0 That is, GCC has advanced the second XOR operation, and at the same time has an additional address fetch operation (ymmword ptr [rsi+rax*8]), which I think may lead to a decrease in efficiency, and I also found that this instruction accounts for a large proportion when I use perf. At the same time, I found that these operations are performed on RTL-PASS through dump-related files, and they don't seem to be easy to change. Is there a good way to get it to generate the assembly code I want? Is it possible to modify my own source files or GCC source code to get that?