While most of the nvptx systems I have access to don't have the support
for CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES,
one has:
Tesla V100-SXM2-16GB (as installed, e.g., on ORNL's Summit) does support
this feature. And with that feature, unified-shared memory support does
work, presumably by handling automatic page migration when a page fault
occurs.
Hence: Enable USM support for those. When doing so, all 'requires
unified_shared_memory' tests of sollve_vv pass :-)
I am not quite sure whether there are unintended side effects, hence, I
have not enabled support for it in general. In particular, 'declare
target enter(global_var)' seems to be mishandled (I think it should be
link + pointer updated to point to the host; cf. description for
'self_maps'). Thus, it is not enabled by default but only when USM has
been requested.
OK for mainline?
Comments? Remarks? Suggestions?
Tobias
PS: I guess some more USM tests should be added…
libgomp: Enable USM for some nvptx devices
A few high-end nvptx devices support the attribute
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES;
for those, unified shared memory is supported in hardware. This
patch enables support for those - if all installed nvptx devices
have this feature (as the capabilities are per device type).
This exposes a bug in gomp_copy_back_icvs as it did before use
omp_get_mapped_ptr to find mapped variables, but that returns
the unchanged pointer in cased of shared memory. But in this case,
we have a few actually mapped pointers - like the ICV variables.
Additionally, there was a mismatch with regards to '-1' for the
device number as gomp_copy_back_icvs and omp_get_mapped_ptr count
differently. Hence, do the lookup manually.
include/ChangeLog:
* cuda/cuda.h
(CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES):
Add.
libgomp/ChangeLog:
* libgomp.texi (nvptx): Update USM description.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices):
Claim support when requesting USM and all devices support
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES.
* target.c (gomp_copy_back_icvs): Fix device ptr lookup.
(gomp_target_init): Set GOMP_OFFLOAD_CAP_SHARED_MEM is the
devices supports USM.
include/cuda/cuda.h | 3 ++-
libgomp/libgomp.texi | 5 ++++-
libgomp/plugin/plugin-nvptx.c | 15 +++++++++++++++
libgomp/target.c | 24 +++++++++++++++++++++++-
4 files changed, 44 insertions(+), 3 deletions(-)
diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 0dca4b3a5c0..db640d20366 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -83,7 +83,8 @@ typedef enum {
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
- CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+ CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+ CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES = 100
} CUdevice_attribute;
enum {
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 71d62105a20..e0d37f67983 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -6435,7 +6435,10 @@ The implementation remark:
the next reverse offload region is only executed after the previous
one returned.
@item OpenMP code that has a @code{requires} directive with
- @code{unified_shared_memory} will remove any nvptx device from the
+ @code{unified_shared_memory} will run on nvptx devices if and only if
+ all of those support the
+ @code{CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES}
+ attribute; otherwise, all nvptx device are removed from the
list of available devices (``host fallback'').
@item The default per-warp stack size is 128 kiB; see also @code{-msoft-stack}
in the GCC manual.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 5aad3448a8d..c4b0f5dd4bf 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1201,8 +1201,23 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
if (num_devices > 0
&& ((omp_requires_mask
& ~(GOMP_REQUIRES_UNIFIED_ADDRESS
+ | GOMP_REQUIRES_UNIFIED_SHARED_MEMORY
| GOMP_REQUIRES_REVERSE_OFFLOAD)) != 0))
return -1;
+ /* Check whether automatic page migration is supported; if so, enable USM.
+ Currently, capabilities is per device type, hence, check all devices. */
+ if (num_devices > 0
+ && (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY))
+ for (int dev = 0; dev < num_devices; dev++)
+ {
+ int pi;
+ CUresult r;
+ r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+ CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES,
+ dev);
+ if (r != CUDA_SUCCESS || pi == 0)
+ return -1;
+ }
return num_devices;
}
diff --git a/libgomp/target.c b/libgomp/target.c
index 5ec19ae489e..48689920d4a 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2969,8 +2969,25 @@ gomp_copy_back_icvs (struct gomp_device_descr *devicep, int device)
if (item == NULL)
return;
+ gomp_mutex_lock (&devicep->lock);
+
+ struct splay_tree_s *mem_map = &devicep->mem_map;
+ struct splay_tree_key_s cur_node;
+ void *dev_ptr = NULL;
+
void *host_ptr = &item->icvs;
- void *dev_ptr = omp_get_mapped_ptr (host_ptr, device);
+ cur_node.host_start = (uintptr_t) host_ptr;
+ cur_node.host_end = cur_node.host_start;
+ splay_tree_key n = gomp_map_0len_lookup (mem_map, &cur_node);
+
+ if (n)
+ {
+ uintptr_t offset = cur_node.host_start - n->host_start;
+ dev_ptr = (void *) (n->tgt->tgt_start + n->tgt_offset + offset);
+ }
+
+ gomp_mutex_unlock (&devicep->lock);
+
if (dev_ptr != NULL)
gomp_copy_dev2host (devicep, NULL, host_ptr, dev_ptr,
sizeof (struct gomp_offload_icvs));
@@ -5303,6 +5320,11 @@ gomp_target_init (void)
{
/* Augment DEVICES and NUM_DEVICES. */
+ /* If USM has been requested and is supported by all devices
+ of this type, set the capability accordingly. */
+ if (omp_requires_mask & GOMP_REQUIRES_UNIFIED_SHARED_MEMORY)
+ current_device.capabilities |= GOMP_OFFLOAD_CAP_SHARED_MEM;
+
devs = realloc (devs, (num_devs + new_num_devs)
* sizeof (struct gomp_device_descr));
if (!devs)