On 29 Jun 2020 22:03, "Brown, Julian" <julian_br...@mentor.com> wrote: On Mon, 29 Jun 2020 21:32:41 +0100 Andrew Stubbs <a...@codesourcery.com> wrote: > In particular, it seems logical that any barrier should be a memory > barrier, so inserting it in the barrier pattern is not a big deal. > IIRC, only OpenACC is using that anyway (OpenMP has explicit asm > inserts in libgomp).
I'd be happier with that idea if ds_{read,write} operations were *only* used for broadcasting -- but they're not, they may also be used for (some) gang-private variables and for reduction temporaries. I don't have a test case for either of those at present demonstrating bad behaviour with no waitcnt, but I guess it's theoretically possible for there to be one, at least. If there's no barrier then a few cycles this way or that shouldn't make any difference, surely? The only exception I can think of might be atomic release operators, but those do a cache flush already, so there shouldn't be any issue with a slightly delayed DS operation. Maybe there should be a wait instruction before those too. The "proper" solution is a general (& "optimal") waitcnt insertion pass, I think, that works with other memory operations as well as the DS ones. Well, yes, that would be nice. The read waits are surely the worst performance loss. It's not a trivial task though, and AMD refused to fund it as a directed services task last winter. Andrew