https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116974

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
One issue is that with SLP scheduling we're relying on data dependence to order
vector stmts in the correct order.  With omp scan we have scalar code like

  _12 = .GOMP_SIMD_LANE (simduid.2_6(D), 0);
  _13 = .GOMP_SIMD_LANE (simduid.2_6(D), 1);
  D.2789[_13] = 0;
  _15 = (long unsigned int) i_42;
  _16 = _15 * 4;
  _18 = a_17(D) + _16;
  _19 = *_18;
  r.0_20 = D.2789[_12];
  _21 = _19 + r.0_20;
  D.2789[_12] = _21;
  _23 = .GOMP_SIMD_LANE (simduid.2_6(D), 2);
  _24 = D.2790[_23];
  _25 = D.2789[_23];
  _26 = _24 + _25;
  D.2790[_23] = _26;
  D.2789[_23] = _26;
  _30 = b_29(D) + _16;
  r.0_31 = D.2789[_12];
  *_30 = r.0_31;

where vectorization of the in-scan reduction is currently performed by
vectorizable_scan_store on the D.2790[_23] = _26 store.  But vector
stmt order with respect to the other "SLP instance" defining D.2789
for non-SLP simply relies on us emitting vector stmts where scalar
stmts are but with SLP this only works because in the end we're using
the first scalar stmt as point to emit.

I think it would be preferable iff the temporaries would be elided as
SSA names and thus not appear as loads/stores.

I'm not sure whether this whole inscan / scan stuff would be necessary
if we'd support vectorizing reductions that are used inside of the loop.
Like by forcing them to be in-order and constructing the vector of
reduction values in each iteration.

That we key off the reduction code-gen from the store and not the
add isn't helpful.  Very likely a cleaner solution would at least
make the scan-reduction visible during SLP discovery so we can key
code generation off that stmt.

We'd want the reduction operands (_24 and _25 above) as well as that
initialization value (0 from the D.2789[_13] = 0 store) as children.

I do have a hackish patch to make most cases work with the current scheme
though.

Reply via email to