Re: [PATCH v2 1/3] libgomp, nvptx: low-latency memory allocator

Tobias Burnus Fri, 08 Sep 2023 02:04:41 -0700

Hi Andrew,

some early comments. I think in general, the direction/patches are fine,
but I have some comments:


On 02.08.23 19:00, Andrew Stubbs wrote:

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.


This probably fits better to 2/3 in the series, but you really should
document the nvptx part, namely:

- that omp_low_lat_mem_space is supported on nvptx

- its limitations (access is restricted to the contention group, i.e.
all threads of a team) → implication on the supported allocators.

- the default size of this memory (8 kiB) and the GOMP_NVPTX_LOWLAT_POOL
environment variable, possibly with mentioning that there is some
internal overhead* which is worsen when using high alignment values. (*
– due to basic_allocator book keeping and for storing pointer to the
OpenMP allocator struct.)

- if I understand it correctly, our default build supports sm_30 and
uses PTX ISA version 3.1 for it. If so, I think we should mention that
nvptx GCC has to be configured with with-arch=sm_... >= sm_53 (=
supported version >=4.1) and, during compilation, no -march= < that
configure-time value may be specified. (Cf. also
https://gcc.gnu.org/install/specific.html#nvptx-x-none )

I think this best fits into
https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html – but one could also
argue that it should be put elsewhere.

It probably makes sense to add a 'See also:' to
https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html pointing
to https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html

* * *

BTW: I think the following should be "...MINOR__ >= 1":

+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+    || (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MAJOR__ >= 1)


* * *

Regarding patch 2/3 and MEMSPACE_VALIDATE.

In general, I wonder how to handle memory spaces (and traits) that
aren't supported. Namely, when to return 0L and when to silently use
ignore the trait / use another memory space.

The current omp_init_allocator code only returns omp_null_allocator for
invalid value – or for pinned memory (as it is unsupported). [RFC: Shall
we keep doing so – or return omp_null_mem_alloc more often? →
https://gcc.gnu.org/PR111044 for this question, improving libmemkind
usage, and extending the allocator-related documentation.]

As we do it on the host, I think auto-fallback to omp_default_mem_space
is is also find for nvptx (and gcn), but not as done in 2/3 but slightly
different:

(a) In omp_init_allocator, there should be a check whether it is
supported, if not, we can fallback to using default memory space. (In
line with the current code host + 1/2+2/3 nvptx behaviour.)

Note: That's not the same as the current 2/3 patch. Currently, if
MEMSPACE_VALIDATE fails, a retry is attempted – but the outcome depends
on the value for 'fallback'. When changing the memory space during
omp_init_allocator, only failed 'malloc' will give abort with abort_fb.

(b) For nvptx_memspace_validate, I think an additional check should be
done based on the __PTX_ISA_VERSION* as it feels off if plugin first
claims support for it but later unconditionally uses malloc at runtime.

(c) We also need to handle omp_low_lat_mem_alloc. I think the spec
implies access:all but nvptx/gcn only support cgroup (+ pteams +
thread), potentially leading to wrong code. Example (hopefully, I got
the syntax right:

#pragma omp target uses_allocator(omp_low_lat_mem_alloc)

#pragma omp teams firstprivate(var) allocate(omp_low_lat_mem_alloc: var)

#pragma omp distribute parallel for

...

#omp atomic ...

... var ...


The current 2/3 checks in alloc/calloc/realloc only cover user-defined
allocators; if we move the check for user-defined allocators to
omp_init_allocator, we actually only need to handle predefined
allocators in alloc/calloc/realloc.

And finally: As mentioned off list, I believe that for the patch 2/3,
the pteam should be cgroup (contention group), i.e. about all threads of
a team / implicit parallel and not only the innermost parallel (pteam).
That actually matches the "access != all" check, but I think "access =
cgroup" should also be tested for in the testsuite.

* * *

3/3 patch for GCN: I think the situation is similar, except that there
is no ISA version issue and most is handled by 1/3 and 2/3 such that
only updating documentation remains.

* * *

libgomp/ChangeLog:

      * allocator.c (MEMSPACE_ALLOC): New macro.
      (MEMSPACE_CALLOC): New macro.
      (MEMSPACE_REALLOC): New macro.
      (MEMSPACE_FREE): New macro.


BTW: You could (but are not required to) combine multiple macro/function
names to a single '(name1, name2, ...):' if all have the same
description, which saves (only) a few lines.

--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c

...

+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  calloc (1, (((void)(MEMSPACE), (SIZE))))


I am not sure whether I like that there is no 'size_t nmemb' argument
(as it is always 1) or not (given that stdlib.c's calloc has size and nmemb).
(Hence, I am fine with either.)

+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.
+   When the user calls omp_alloc with a predefined allocator this
+   table determines what memory they get.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */


The first line is misleading: omp_null_allocator uses the allocator
associated with def-allocator-var ICV, i.e. any of the predefined
allocators might be used.

As omp_null_allocator is mapped to the def-allocator-var ICV or
(if unset) to the omp_default_mem_space, there should not be any
access to predefined_alloc_mapping[omp_null_allocator].

Still, the code is confusing. I think at a comment is required.
And: Either we still keep that superfluous line or we access
the array as
  predefined_alloc_mapping[predef_alloc - 1]
which IMO requires a macro or inline function to avoid having a
puzzling "-1" in the code.

* * *

I wonder whether we should have a static assert checking for
  ARRAY_SIZE (predefined_alloc_mapping) == omp_max_predefined_alloc
(or '+ 1', depending how we deal with omp_null_allocator) to ensure
better consistency. (The value is #defined in allocate.c not in omp.h,
but a static assert at least can catch one mismatch.)

[While static_assert is only in C2X alias C23, _Static_assert exists before.
(I think since C11 but GCC also accepts it with -std=c98; GCC >= 9
permits omitting the second/string argument of _Static_assert.
And with 2nd arg, it already works with GCC 7 (= oldest GCC at hand).

[I wrote ARRAY_SIZE in the sense of '#define ARRAY_SIZE(a) (sizeof (a) / sizeof 
((a)[0]))',
as e.g. defined include/libiberty.h (that file is not included in libgomp/.]

* * *

...
+  omp_low_lat_mem_space,   /* omp_cgroup_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_pteam_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_thread_mem_alloc. */


I think there should be a comment like: /* Implementation choice:  */
Thus, when later revisiting it, it is clear that it can be changed.

I think it would make sense to document the used memory space in libgomp.texi
alias https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html

Namely replacing the dash in the table by, e.g.,
'omp_low_lat_mem_space' (implementation choice)'
or something like that.

(I personally like that the documentation makes clear
- if sensibly possible - whether a piece of information in a
compiler documentation is generic (matches the spec) or is
an implementation choice. In any case, the documentation should
match what's implemented.)

Note: That omp_low_lat_mem_space == omp_default_mem_space
is already documented at
https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html

Maybe the wording needs to be tweaked now as nvptx + gcn actually
handle the low-lat memory space differently. (While on the host,
a failed 'malloc' is now repeated once, which is not really observable.)

Or it is fine and "See also" to the target-specific section is enough.

(BTW: Wording/documentation suggestions and/or patches are welcome!)

* * *

+      /* Otherwise, we've already performed default mem allocation
+      and if that failed, it won't succeed again (unless it was
+      intermittent).  Return NULL then, as that is the fallback.  */


Thanks for adding the missing ')'.  (Twice.)

And thanks for the patch set in general.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [PATCH v2 1/3] libgomp, nvptx: low-latency memory allocator

Reply via email to