On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > problem and would like to ask for advice. > > For example, for the following code: > > int main() { > int size = 1000; > int *foo = malloc(sizeof(int) * size); > int c1 = rand(), t1 = rand(); > > for (int i = 0; i < size; i++) { > if (foo[i] & c1) { > foo[i] = t1; > } > } > > // prevents the loop above from being optimized > for (int i = 0; i < size; i++) { > printf("%d", foo[i]); > } > } > > First of all, the if statement block in the loop will be converted to > a MASK_STORE through if-conversion optimization. But after > tree-vector, it will still become a branched form. The part of the > final disassembly structure probably looks like below(Using IDA to do > this), and you can see that there is still such a branch 'if ( !_ZF )' > in it, which will lead to low efficiency. > > do > { > while ( 1 ) > { > __asm > { > vpand ymm0, ymm2, ymmword ptr [rax] > vpcmpeqd ymm0, ymm0, ymm1 > vpcmpeqd ymm0, ymm0, ymm1 > vptest ymm0, ymm0 > } > if ( !_ZF ) > break; > _RAX += 8; > if ( _RAX == v9 ) > goto LABEL_5; > } > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > _RAX += 8; > } > while ( _RAX != v9 ); > > Why can't we just replace the vptest and if statement with some other > instructions like vpblendvb so that it can be faster? Or is there a > good way to do that?
The branch is added by optimize_mask_stores after vectorization because fully masked (disabled) masked stores can incur a quite heavy penalty on some architectures when fault assists (read-only pages, but also COW pages) are ran into. All the microcode handling needs to possibly be carried out multiple times, for each such access to the same page. That can cause a 1000x slowdown when you hit this case. Thus every masked store is replaced by if (mask != 0) masked_store (); and this is an optimization (which itself has a small cost). Richard. > > Thanks > Hanke Zhang