[Bug tree-optimization/109088] GCC does not always vectorize conditional reduction

juzhe.zhong at rivai dot ai via Gcc-bugs Wed, 27 Sep 2023 00:34:46 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109088


--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #9)
> (In reply to JuzheZhong from comment #8)
> > It's because the order of the operations we are doing:
> > 
> > For code as follows:
> > 
> > result += mask ? a[i] + x : 0;
> > 
> > GCC:
> > result_ssa_1 = PHI <result_ssa_2, 0>
> > ...
> > STMT 1. tmp = a[i] + x;
> > STMT 2. tmp2 = tmp + result_ssa_1;
> > STMT 3. result_ssa_2 = mask ? tmp2 : result_ssa_1;
> > 
> > Here we can see both STMT 2 and STMT 3 are using 'result_ssa_1',
> > we end up with 2 uses of the PHI result. Then, we failed to vectorize.
> > 
> > Wheras LLVM:
> > 
> > result_ssa_1 = PHI <result_ssa_2, 0>
> > ...
> > IR 1. tmp = a[i] + x;
> > IR 2. tmp2 = mask ? tmp : 0;
> > IR 3. result_ssa_2 = tmp2 + result_ssa_1.
> 
> For floating point these are not equivalent (adding zero isn't a no-op).


Yes, I agree these are not equivalent for floating-point.
But I they are equivalent if we specify -ffast-math.

I have double checked LLVM, they failed to vectorize conditionl
floating-point reduction too by default.

However, if we specify LLVM -ffast-math, it will generate the same 
if-conversion IR sequence as integer, then vectorization succeed.


> 
> > LLVM only has 1 use.
> > 
> > Is it reasonable to swap the order in match.pd ?
> 
> if-conversion could be teached to swap this (it's if-conversion creating
> the IL for conditional reductions) when valid.  IIRC Robin Dapp also has
> a patch to make if-conversion emit .COND_ADD instead which should make
> it even better to vectorize.

I knew that patch, Robin is trying fixing the issue (in-order reduction)that I
posted.

I have confirm that patch can't help since it didn't modify the code for this
case, we will end up with multiple use in conditional reduction.

The reduction failed since:

  /* If this isn't a nested cycle or if the nested cycle reduction value
     is used ouside of the inner loop we cannot handle uses of the reduction
     value.  */
  if (nlatch_def_loop_uses > 1 || nphi_def_loop_uses > 1)
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "reduction used in loop.\n");
      return NULL;
    }

when  nphi_def_loop_uses  > 1, we failed to vectorize.

I have checked LLVM codes, and I think we can extend this function:

strip_nop_cond_scalar_reduction

We should be able to strip all the statement until we can reach the
use of PHI result, like this:

LLVM is able to handle this case:

for ()
  if (cond)
    result += a[i] + b[i] + c[i] + .... 

No matter how many variables are added in the condition reduction.
They well handle that since they keep iterating all the statement until
reach the result:

result_ssa_1 = PHI <>
tmp1 = result_ssa_1 + a[i];
tmp2 = tmp1 + b[i];
tmp3 = tmp2 + c[i];
....

We keep iterating until find the result_ssa_1 to hold the reduction variable.

Is this LLVM's approach reasonable to GCC?

If yes, I can translate LLVM code into GCC.

Thanks.

[Bug tree-optimization/109088] GCC does not always vectorize conditional reduction

Reply via email to