Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
problem and would like to ask for advice.

For example, for the following code:

int main() {
  int size = 1000;
  int *foo = malloc(sizeof(int) * size);
  int c1 = rand(), t1 = rand();

  for (int i = 0; i < size; i++) {
    if (foo[i] & c1) {
      foo[i] = t1;
    }
  }

  // prevents the loop above from being optimized
  for (int i = 0; i < size; i++) {
    printf("%d", foo[i]);
  }
}

First of all, the if statement block in the loop will be converted to
a MASK_STORE through if-conversion optimization. But after
tree-vector, it will still become a branched form. The part of the
final disassembly structure probably looks like below(Using IDA to do
this), and you can see that there is still such a branch 'if ( !_ZF )'
in it, which will lead to low efficiency.

do
  {
    while ( 1 )
    {
      __asm
      {
        vpand   ymm0, ymm2, ymmword ptr [rax]
        vpcmpeqd ymm0, ymm0, ymm1
        vpcmpeqd ymm0, ymm0, ymm1
        vptest  ymm0, ymm0
      }
      if ( !_ZF )
        break;
      _RAX += 8;
      if ( _RAX == v9 )
        goto LABEL_5;
    }
    __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
    _RAX += 8;
  }
  while ( _RAX != v9 );

Why can't we just replace the vptest and if statement with some other
instructions like vpblendvb so that it can be faster? Or is there a
good way to do that?

Thanks
Hanke Zhang

Reply via email to