On 02/12/2019 14:23, Thomas Schwinge wrote:
Hi!
On 2019-11-15T13:43:04+0100, Jakub Jelinek <ja...@redhat.com> wrote:
On Fri, Nov 15, 2019 at 12:38:06PM +0000, Andrew Stubbs wrote:
On 15/11/2019 12:21, Jakub Jelinek wrote:
I'm surprised by the set acc_mem_shared 0, I thought gcn is a shared memory
offloading target.
APUs, such as Carizzo are shared memory. DGPUs, such as Fiji and Vega, have
their own memory. A DGPU can access host memory, provided that it has been
set up just so, but that is very slow, and I don't know of a way to do that
without still having to copy the program data into that special region.
For a few years already, Nvidia GPUs/drivers have been supporting what
they call Unified Memory, where the driver/kernel automatically handles
the movement of memory pages between host/device memories. Given some
reasonable pre-fetching logic (either automatic in the driver/kernel, or
"guided" by the compiler/runtime), this reportedly achieves good
performance -- or even better performance than manually-managed memory
copying, as really only the data pages accessed (plus pre-fetched) will
be copied.
Yeah, this is not that. When the AMD GPU accesses host memory it
appears to bypass both L1 and L2 caching. There's no copying, just
direct, on-demand accesses. This makes the performance really bad. We
use it only for message passing, which is probably the original intent.
For example, see <https://dl.acm.org/citation.cfm?id=3356141> "Compiler
assisted hybrid implicit and explicit GPU memory management under unified
address space", which I've recently (SuperComputing 2019) have seen
presented, or other publications.
This is not currently implemented in GCC, but could/should be at some
point.
This (or even a mixture of manual-discrete/automatic-shared?) would then
be an execution mode of libgomp/plugin, selected at run-time?
All we really need from libgomp, to support AMD APUs, is to be able to
toggle the shared memory mode dynamically, rather than having it baked
into the capabilities at start-up. Probably we could figure out the
capabilities at run-time already, but that would break when a system has
both kinds of device. Anyway, this is theoretical as I have no intention
to implement support for such devices.
Andrew