On 30/05/2025 23:36, Tobias Burnus wrote:
Attached patch adds omp_target_memset and omp_target_memset_async
permitting to set (potentially large) data on the device to a
certain value - in particular to '\0'.

It uses 'memset' on the host (and for shared memory, e.g. via
requires unified_shared_memory/self_maps). For nvptx, cuMemsetD8
is used and for AMD GPUs hsa_amd_memory_fill. However, the latter
only supports 4byte aligned data, working in multiples of 4byte.

@Sandra: Any .texi comments? (Or generic comments.)
@Thomas, Jakub, anyone: Any comment?

@Andrew, anyone: Any suggestion regarding the GCN implementation?
At the moment, the code is fine for 4-byte aligned data that has
a size of multiples of 4 bytes, count being large, or count < 4.
Worst case is size 1+4+1 with 1 byte required to get aligned data.
The question is when the turnover from calloc + host2dev + free
to using: misalign dev2host + fill + tailing dev2host.
Thoughts?
Tobias

The hsa_memory_copy API is known to be slow, so for smaller data sizes it's probably better to have one hsa_memory_copy replace the whole memset than use three API calls, even with setting up some host-side memory to copy from. This is probably pretty easy to measure anyway.

My biggest problem is that your code contains no comment to explain why this function doesn't look as simple as you'd expect. The reader is just expected to be familiar with the limitations of HSA, which I would suggest is an unreasonable expectation.

Andrew


PS: As some implementation is better than no and as it works,
I intent to commit the patch early next week, but it feels like
something that should eventually be revisited for the AMD issue.

Reply via email to