https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088
--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> --- (In reply to Richard Biener from comment #9) > (In reply to JuzheZhong from comment #8) > > It's because the order of the operations we are doing: > > > > For code as follows: > > > > result += mask ? a[i] + x : 0; > > > > GCC: > > result_ssa_1 = PHI <result_ssa_2, 0> > > ... > > STMT 1. tmp = a[i] + x; > > STMT 2. tmp2 = tmp + result_ssa_1; > > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1; > > > > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1', > > we end up with 2 uses of the PHI result. Then, we failed to vectorize. > > > > Wheras LLVM: > > > > result_ssa_1 = PHI <result_ssa_2, 0> > > ... > > IR 1. tmp = a[i] + x; > > IR 2. tmp2 = mask ? tmp : 0; > > IR 3. result_ssa_2 = tmp2 + result_ssa_1. > > For floating point these are not equivalent (adding zero isn't a no-op). Yes, I agree these are not equivalent for floating-point. But I they are equivalent if we specify -ffast-math. I have double checked LLVM, they failed to vectorize conditionl floating-point reduction too by default. However, if we specify LLVM -ffast-math, it will generate the same if-conversion IR sequence as integer, then vectorization succeed. > > > LLVM only has 1 use. > > > > Is it reasonable to swap the order in match.pd ? > > if-conversion could be teached to swap this (it's if-conversion creating > the IL for conditional reductions) when valid. IIRC Robin Dapp also has > a patch to make if-conversion emit .COND_ADD instead which should make > it even better to vectorize. I knew that patch, Robin is trying fixing the issue (in-order reduction)that I posted. I have confirm that patch can't help since it didn't modify the code for this case, we will end up with multiple use in conditional reduction. The reduction failed since: /* If this isn't a nested cycle or if the nested cycle reduction value is used ouside of the inner loop we cannot handle uses of the reduction value. */ if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "reduction used in loop.\n"); return NULL; } when nphi_def_loop_uses > 1, we failed to vectorize. I have checked LLVM codes, and I think we can extend this function: strip_nop_cond_scalar_reduction We should be able to strip all the statement until we can reach the use of PHI result, like this: LLVM is able to handle this case: for () if (cond) result += a[i] + b[i] + c[i] + .... No matter how many variables are added in the condition reduction. They well handle that since they keep iterating all the statement until reach the result: result_ssa_1 = PHI <> tmp1 = result_ssa_1 + a[i]; tmp2 = tmp1 + b[i]; tmp3 = tmp2 + c[i]; .... We keep iterating until find the result_ssa_1 to hold the reduction variable. Is this LLVM's approach reasonable to GCC? If yes, I can translate LLVM code into GCC. Thanks.