On 29 Jun 2020 22:03, "Brown, Julian" <julian_br...@mentor.com> wrote:
On Mon, 29 Jun 2020 21:32:41 +0100
Andrew Stubbs <a...@codesourcery.com> wrote:
> In particular, it seems logical that any barrier should be a memory
> barrier, so inserting it in the barrier pattern is not a big deal.
> IIRC, only OpenACC is using that anyway (OpenMP has explicit asm
> inserts in libgomp).

I'd be happier with that idea if ds_{read,write} operations were *only*
used for broadcasting -- but they're not, they may also be used for
(some) gang-private variables and for reduction temporaries. I don't
have a test case for either of those at present demonstrating bad
behaviour with no waitcnt, but I guess it's theoretically possible for
there to be one, at least.

If there's no barrier then a few cycles this way or that shouldn't make any 
difference, surely?

The only exception I can think of might be atomic release operators, but those 
do a cache flush already, so there shouldn't be any issue with a slightly 
delayed DS operation. Maybe there should be a wait instruction before those too.

The "proper" solution is a general (& "optimal") waitcnt insertion
pass, I think, that works with other memory operations as well as the
DS ones.

Well, yes, that would be nice. The read waits are surely the worst performance 
loss. It's not a trivial task though, and AMD refused to fund it as a directed 
services task last winter.

Andrew

Reply via email to