Re: [committed] amdgcn: Add fold_left_plus vector reductions

Andrew Stubbs Tue, 07 Jul 2020 04:44:32 -0700

On 07/07/2020 12:03, Richard Sandiford wrote:

Andrew Stubbs <a...@codesourcery.com> writes:

This patch implements a floating-point fold_left_plus vector pattern,
which gives a significant speed-up in the BabelStream "dot" benchmark.


The GCN architecture can't actually do an in-order vector reduction any
more efficiently than that equivalent scalar algorithm, so this is a bit
of a cheat.  However, dividing the problem into threads using OpenACC or
OpenMP has already broken the in-order semantics, so we may as well
optimize the operation at the vector level too.

If the user has specifically sorted the input data in order to get a
more correct FP result then using multiple threads is already the wrong
thing to do. But, if the input data is in no particular numerical order
then this optimization will give a correct answer much faster, albeit
possibly a slightly different one each run.


There doesn't seem to be anything GCN-specific here though.
If pragmas say that we can ignore associativity rules, we should apply
that in target-independent code rather than in each individual target.

Yes, I'm lazy. That, and I'm not sure what a target independent solutionwould look like.

Presumably we'd need something for both OpenMP and OpenACC, and it wouldneed to be specific to certain operations (not just blanket-fassociative-math), which means the vectorizer (anywhere else?) wouldneed to be taught about the new thing?

The nearest example I can think of is the force_vectorize flag thatOpenMP "simd" and OpenACC "vector" already use (the latter beingamdgcn-only as nvptx does its own OpenACC vectorization).

I'm also not completely convinced that this -- or other cases like it --isn't simply a target-specific issue. Could it be harmful on otherarchitectures?


Anyway, ultimately I don't have time to do much more here.

Andrew

Re: [committed] amdgcn: Add fold_left_plus vector reductions

Reply via email to