Question about vectorization optimization during RTL-PASS

Hanke Zhang via Gcc Sun, 12 Nov 2023 23:56:51 -0800

Hi, I've been working on vectorization-related optimization lately.
GCC seems to have some optimization vulnerabilities. I would like to
ask if it can be solved.


For example, for the following program using AVX2:

#include <immtrin.h>
// reg->node2[i].state is an unsigned long long variable
// reg->size is an integer variable that represents the iterations

for (int i = 0; i < reg->size; i+=4) {
  /* original code:
  unsigned long long state = reg->node2[i].state;
  if (state & (1LLU << j + 1 | 1LLU << width + j))
      state ^= (1LLU << j);
  state ^= (1LLU  << width + j);
  */
  __m256i state = _mm256_loadu_si256((__m256i *)((char*)(reg->node2) +
i * sizeof(unsigned long long)));

  __m256i mask1 = _mm256_set1_epi64x(1LLU << j + 1 | 1LLU << width + j);
  // cmp
  __m256i tmp1 = _mm256_and_si256(state, mask1);
  __m256i cmp1 = _mm256_cmpeq_epi64(tmp1, mask1);
  // xor
  __m256i xor_param = _mm256_set1_epi64x(1LLU << j);
  __m256i tmp2 = _mm256_and_si256(xor_param, cmp1);
  __m256i xor_result = _mm256_xor_si256(state, tmp2);
  // xor
  __m256i xor_param2 = _mm256_set1_epi64x(1LLU << width + j);
  __m256i xor_res2 = _mm256_xor_si256(xor_result, xor_param2);

  _mm256_storeu_si256((__m256i *)((char*)(reg->node2) + i *
sizeof(unsigned long long)), xor_res2);
}

My expectation is to generate assembly code like this:

vpxor   ymm6, ymm2, ymmword ptr [r9+r15*8]
vpand   ymm4, ymm1, ymm6
vpcmpeqq ymm5, ymm4, ymm1
vpand   ymm7, ymm3, ymm5
vpxor   ymm8, ymm6, ymm7
vmovdqu ymmword ptr [r9+r15*8], ymm8

But the actual generated assembly code looks like this:

vpand   ymm0, ymm2, ymmword ptr [rsi+rax*8]
vpxor   ymm1, ymm4, ymmword ptr [rsi+rax*8]
vpcmpeqq ymm0, ymm0, ymm2
vpand   ymm0, ymm0, ymm5
vpxor   ymm0, ymm0, ymm1
vmovdqu ymmword ptr [rsi+rax*8], ymm0

That is, GCC has advanced the second XOR operation, and at the same
time has an additional address fetch operation (ymmword ptr
[rsi+rax*8]), which I think may lead to a decrease in efficiency, and
I also found that this instruction accounts for a large proportion
when I use perf.

At the same time, I found that these operations are performed on
RTL-PASS through dump-related files, and they don't seem to be easy to
change. Is there a good way to get it to generate the assembly code I
want? Is it possible to modify my own source files or GCC source code
to get that?

Question about vectorization optimization during RTL-PASS

Reply via email to