Hi, I'm recently working on vectorization of GCC. I'm stuck in a small problem and would like to ask for advice.
For example, for the following code: int main() { int size = 1000; int *foo = malloc(sizeof(int) * size); int c1 = rand(), t1 = rand(); for (int i = 0; i < size; i++) { if (foo[i] & c1) { foo[i] = t1; } } // prevents the loop above from being optimized for (int i = 0; i < size; i++) { printf("%d", foo[i]); } } First of all, the if statement block in the loop will be converted to a MASK_STORE through if-conversion optimization. But after tree-vector, it will still become a branched form. The part of the final disassembly structure probably looks like below(Using IDA to do this), and you can see that there is still such a branch 'if ( !_ZF )' in it, which will lead to low efficiency. do { while ( 1 ) { __asm { vpand ymm0, ymm2, ymmword ptr [rax] vpcmpeqd ymm0, ymm0, ymm1 vpcmpeqd ymm0, ymm0, ymm1 vptest ymm0, ymm0 } if ( !_ZF ) break; _RAX += 8; if ( _RAX == v9 ) goto LABEL_5; } __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } _RAX += 8; } while ( _RAX != v9 ); Why can't we just replace the vptest and if statement with some other instructions like vpblendvb so that it can be faster? Or is there a good way to do that? Thanks Hanke Zhang