On 07/07/2020 12:03, Richard Sandiford wrote:
Andrew Stubbs <a...@codesourcery.com> writes:
This patch implements a floating-point fold_left_plus vector pattern,
which gives a significant speed-up in the BabelStream "dot" benchmark.

The GCN architecture can't actually do an in-order vector reduction any
more efficiently than that equivalent scalar algorithm, so this is a bit
of a cheat.  However, dividing the problem into threads using OpenACC or
OpenMP has already broken the in-order semantics, so we may as well
optimize the operation at the vector level too.

If the user has specifically sorted the input data in order to get a
more correct FP result then using multiple threads is already the wrong
thing to do. But, if the input data is in no particular numerical order
then this optimization will give a correct answer much faster, albeit
possibly a slightly different one each run.

There doesn't seem to be anything GCN-specific here though.
If pragmas say that we can ignore associativity rules, we should apply
that in target-independent code rather than in each individual target.

Yes, I'm lazy. That, and I'm not sure what a target independent solution would look like.

Presumably we'd need something for both OpenMP and OpenACC, and it would need to be specific to certain operations (not just blanket -fassociative-math), which means the vectorizer (anywhere else?) would need to be taught about the new thing?

The nearest example I can think of is the force_vectorize flag that OpenMP "simd" and OpenACC "vector" already use (the latter being amdgcn-only as nvptx does its own OpenACC vectorization).

I'm also not completely convinced that this -- or other cases like it -- isn't simply a target-specific issue. Could it be harmful on other architectures?

Anyway, ultimately I don't have time to do much more here.

Andrew

Reply via email to