https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116974
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- One issue is that with SLP scheduling we're relying on data dependence to order vector stmts in the correct order. With omp scan we have scalar code like _12 = .GOMP_SIMD_LANE (simduid.2_6(D), 0); _13 = .GOMP_SIMD_LANE (simduid.2_6(D), 1); D.2789[_13] = 0; _15 = (long unsigned int) i_42; _16 = _15 * 4; _18 = a_17(D) + _16; _19 = *_18; r.0_20 = D.2789[_12]; _21 = _19 + r.0_20; D.2789[_12] = _21; _23 = .GOMP_SIMD_LANE (simduid.2_6(D), 2); _24 = D.2790[_23]; _25 = D.2789[_23]; _26 = _24 + _25; D.2790[_23] = _26; D.2789[_23] = _26; _30 = b_29(D) + _16; r.0_31 = D.2789[_12]; *_30 = r.0_31; where vectorization of the in-scan reduction is currently performed by vectorizable_scan_store on the D.2790[_23] = _26 store. But vector stmt order with respect to the other "SLP instance" defining D.2789 for non-SLP simply relies on us emitting vector stmts where scalar stmts are but with SLP this only works because in the end we're using the first scalar stmt as point to emit. I think it would be preferable iff the temporaries would be elided as SSA names and thus not appear as loads/stores. I'm not sure whether this whole inscan / scan stuff would be necessary if we'd support vectorizing reductions that are used inside of the loop. Like by forcing them to be in-order and constructing the vector of reduction values in each iteration. That we key off the reduction code-gen from the store and not the add isn't helpful. Very likely a cleaner solution would at least make the scan-reduction visible during SLP discovery so we can key code generation off that stmt. We'd want the reduction operands (_24 and _25 above) as well as that initialization value (0 from the D.2789[_13] = 0 store) as children. I do have a hackish patch to make most cases work with the current scheme though.