Hi Andrew,

→ https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655947.html

On June 28, 2024, Andrew Stubbs wrote:
Some GPU compute systems allow the GPU to access host memory without much
prior setup, but that's not necessarily the fast way to do it.  For shared
memory APUs this is almost certainly the correct choice, but for AMD there
is the difference between "fine-grained" and "coarse-grained" memory, and
for NVidia Cuda generally runs better if it knows the status of the memory
you access.

In my understanding, the migration on page fail USM implementation works
rather well in general, both with AMD and Nvidia GPUs. It obviously has
issues, e.g. when the same page is accessed frequently (semi-)concurrently
by host and the device as it then keeps migrating forth and back. Or when
you access a large array - where mapping it in one go - is faster than
keeping hitting the page boundary. And if the data is handled through an
interconnect like with NVlink on PowerPC Volta, there is a long latency.

The issue with the page migration forth and back can be solved by placing it
in a pinned memory (best in one provided by the GPU runtime). And the
page-boundary issue can be fixed by using large pages for large data.

Therefore, I think for USM, switching to no mapping by default is not a
bad idea. (However, see env var idea below.)

* * *

Not a review, but some first comments (glancing also at my local WIP patch + my
personal to-do list):

* I need to finish my patch that still does mapping with 'declare target 
enter(…)'
  variables - otherwise, automatically turning on GOMP_OFFLOAD_CAP_SHARED_MEM 
will
  give tons of fails as those systems. (Generic issue: It should also be fixed 
for
  'requires unified_shared_memory', but typical smaller USM code is less likely 
to
  thit this issue.) For 'declare target link', I have already posted a patch, 
but
  that has still to be committed.

* I think having a per-device property would be useful. In principle, it would 
be
  nice that - when two GPUs exist on a system but only one has shared-memory 
support
  - that USM GPU would be selected with 'requires unified_shared_memory'. 
Currently,
  all GPUs are then excluded. Example: Richi's gfx1030 and gfx1036, where only 
gfx1036
supports USM. Currently,|HSA_AMD_SYSTEM_INFO_SVM_SUPPORTED is false when both of his GPUs are enabled as that's a system property and not a per-device property. (For nvidia, we have 'CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS'; for this feature, we had to exclude them in the count in |GOMP_OFFLOAD_get_num_devices but also in the later GOMP_OFFLOAD_init_device they need to be skipped.)
* For AMD APUs, I wonder whether we chould use instead the following:
hsa_agent_get_info (agent, HSA_AMD_AGENT_INFO_MEMORY_PROPERTIES, &memory_properties) plus hsa_flag_isset64 (memory_properties, HSA_AMD_MEMORY_PROPERTY_AGENT_IS_APU) (this needs a newer 'hsa_ext_amd.h' than included in GCC or additional defs
  in our copy of the .h  file).
  [I don't know in which ROCm version this feature got added.]

<offtopic>

Talking about this API, I wonder whether we also want to use:
HSA_AMD_AGENT_INFO_MAX_WAVES_PER_CU (returning an uint32_t) I do see it in our hsa_ext_amd.h but it is not used elsewhere. In our documentation, we claim, https://gcc.gnu.org/onlinedocs/libgomp/AMD-Radeon.html
"The hardware permits maximally 40 workgroups/CU" but if I run that check
on gfx90a, I get '32' and not '40' as result. (On gfx908 it is 40.)

On the other hand, 40 only shows up in the comment to
parse_target_attributes - while 32 is used in gcn_exec.

</offtopic>

* Regarding USM vs. MAPPING:
  OpenMP leaves it open whether with 'requires unified_shared_memory', explicit
  mapping is honored or not. — With (OpenMP 6.0's) 'requires self_maps', no
  mapping is permitted.

Your patch changes the current handling: With non-APUs, it will not set the
GOMP_OFFLOAD_CAP_SHARED_MEM for 'requires unified_shared_memory',
but that will cause that 'map' clauses are not ignored.

I think that's okay(ish) at least for explicit map clauses, but I am not sure
whether it is for implicit maps. In any case, if we do so, we probably must
update omp_target_is_accessible - otherwise we return false even when USM has
been required, if the capabilities do not include GOMP_OFFLOAD_CAP_SHARED_MEM.

In any case, similar to CRAY, I think it makes sense to have an env var to
toggle between mapping vs. not mapping for USM-supporting devices that
aren't GPUs.

* Compiler side:
- I think we need to turn all auto 'declare target enter(...)' variable
  to 'link' when 'requires unified_shared_memory' is used.
  (At least that's how I read the TR13/6.0 spec.)
- For 'self_map', we have to do likewise for all declare-target variables.
- I was thinking of adding a commandline flag to force-change 'enter' to
  'link' - applicable to both 'usm' and no requires line. (For 'self_map'
  it would be always on and for 'usm' it would still be on for automatic
  'declare target handling.)

Tobias

Reply via email to