Hi Andrew, some early comments. I think in general, the direction/patches are fine, but I have some comments:
On 02.08.23 19:00, Andrew Stubbs wrote:
This patch adds support for allocating low-latency ".shared" memory on NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory can be allocated, reallocated, and freed using a basic but fast algorithm, is thread safe and the size of the low-latency heap can be configured using the GOMP_NVPTX_LOWLAT_POOL environment variable. The use of the PTX dynamic_smem_size feature means that low-latency allocator will not work with the PTX 3.1 multilib.
This probably fits better to 2/3 in the series, but you really should document the nvptx part, namely: - that omp_low_lat_mem_space is supported on nvptx - its limitations (access is restricted to the contention group, i.e. all threads of a team) → implication on the supported allocators. - the default size of this memory (8 kiB) and the GOMP_NVPTX_LOWLAT_POOL environment variable, possibly with mentioning that there is some internal overhead* which is worsen when using high alignment values. (* – due to basic_allocator book keeping and for storing pointer to the OpenMP allocator struct.) - if I understand it correctly, our default build supports sm_30 and uses PTX ISA version 3.1 for it. If so, I think we should mention that nvptx GCC has to be configured with with-arch=sm_... >= sm_53 (= supported version >=4.1) and, during compilation, no -march= < that configure-time value may be specified. (Cf. also https://gcc.gnu.org/install/specific.html#nvptx-x-none ) I think this best fits into https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html – but one could also argue that it should be put elsewhere. It probably makes sense to add a 'See also:' to https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html pointing to https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html * * * BTW: I think the following should be "...MINOR__ >= 1":
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \ + || (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MAJOR__ >= 1)
* * * Regarding patch 2/3 and MEMSPACE_VALIDATE. In general, I wonder how to handle memory spaces (and traits) that aren't supported. Namely, when to return 0L and when to silently use ignore the trait / use another memory space. The current omp_init_allocator code only returns omp_null_allocator for invalid value – or for pinned memory (as it is unsupported). [RFC: Shall we keep doing so – or return omp_null_mem_alloc more often? → https://gcc.gnu.org/PR111044 for this question, improving libmemkind usage, and extending the allocator-related documentation.] As we do it on the host, I think auto-fallback to omp_default_mem_space is is also find for nvptx (and gcn), but not as done in 2/3 but slightly different: (a) In omp_init_allocator, there should be a check whether it is supported, if not, we can fallback to using default memory space. (In line with the current code host + 1/2+2/3 nvptx behaviour.) Note: That's not the same as the current 2/3 patch. Currently, if MEMSPACE_VALIDATE fails, a retry is attempted – but the outcome depends on the value for 'fallback'. When changing the memory space during omp_init_allocator, only failed 'malloc' will give abort with abort_fb. (b) For nvptx_memspace_validate, I think an additional check should be done based on the __PTX_ISA_VERSION* as it feels off if plugin first claims support for it but later unconditionally uses malloc at runtime. (c) We also need to handle omp_low_lat_mem_alloc. I think the spec implies access:all but nvptx/gcn only support cgroup (+ pteams + thread), potentially leading to wrong code. Example (hopefully, I got the syntax right: #pragma omp target uses_allocator(omp_low_lat_mem_alloc) #pragma omp teams firstprivate(var) allocate(omp_low_lat_mem_alloc: var) #pragma omp distribute parallel for ... #omp atomic ... ... var ... The current 2/3 checks in alloc/calloc/realloc only cover user-defined allocators; if we move the check for user-defined allocators to omp_init_allocator, we actually only need to handle predefined allocators in alloc/calloc/realloc. And finally: As mentioned off list, I believe that for the patch 2/3, the pteam should be cgroup (contention group), i.e. about all threads of a team / implicit parallel and not only the innermost parallel (pteam). That actually matches the "access != all" check, but I think "access = cgroup" should also be tested for in the testsuite. * * * 3/3 patch for GCN: I think the situation is similar, except that there is no ISA version issue and most is handled by 1/3 and 2/3 such that only updating documentation remains. * * *
libgomp/ChangeLog: * allocator.c (MEMSPACE_ALLOC): New macro. (MEMSPACE_CALLOC): New macro. (MEMSPACE_REALLOC): New macro. (MEMSPACE_FREE): New macro.
BTW: You could (but are not required to) combine multiple macro/function names to a single '(name1, name2, ...):' if all have the same description, which saves (only) a few lines.
--- a/libgomp/allocator.c +++ b/libgomp/allocator.c
...
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \ + calloc (1, (((void)(MEMSPACE), (SIZE))))
I am not sure whether I like that there is no 'size_t nmemb' argument (as it is always 1) or not (given that stdlib.c's calloc has size and nmemb). (Hence, I am fine with either.)
+/* Map the predefined allocators to the correct memory space. + The index to this table is the omp_allocator_handle_t enum value. + When the user calls omp_alloc with a predefined allocator this + table determines what memory they get. */ +static const omp_memspace_handle_t predefined_alloc_mapping[] = { + omp_default_mem_space, /* omp_null_allocator. */ + omp_default_mem_space, /* omp_default_mem_alloc. */
The first line is misleading: omp_null_allocator uses the allocator associated with def-allocator-var ICV, i.e. any of the predefined allocators might be used. As omp_null_allocator is mapped to the def-allocator-var ICV or (if unset) to the omp_default_mem_space, there should not be any access to predefined_alloc_mapping[omp_null_allocator]. Still, the code is confusing. I think at a comment is required. And: Either we still keep that superfluous line or we access the array as predefined_alloc_mapping[predef_alloc - 1] which IMO requires a macro or inline function to avoid having a puzzling "-1" in the code. * * * I wonder whether we should have a static assert checking for ARRAY_SIZE (predefined_alloc_mapping) == omp_max_predefined_alloc (or '+ 1', depending how we deal with omp_null_allocator) to ensure better consistency. (The value is #defined in allocate.c not in omp.h, but a static assert at least can catch one mismatch.) [While static_assert is only in C2X alias C23, _Static_assert exists before. (I think since C11 but GCC also accepts it with -std=c98; GCC >= 9 permits omitting the second/string argument of _Static_assert. And with 2nd arg, it already works with GCC 7 (= oldest GCC at hand). [I wrote ARRAY_SIZE in the sense of '#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0]))', as e.g. defined include/libiberty.h (that file is not included in libgomp/.] * * *
... + omp_low_lat_mem_space, /* omp_cgroup_mem_alloc. */ + omp_low_lat_mem_space, /* omp_pteam_mem_alloc. */ + omp_low_lat_mem_space, /* omp_thread_mem_alloc. */
I think there should be a comment like: /* Implementation choice: */ Thus, when later revisiting it, it is clear that it can be changed. I think it would make sense to document the used memory space in libgomp.texi alias https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html Namely replacing the dash in the table by, e.g., 'omp_low_lat_mem_space' (implementation choice)' or something like that. (I personally like that the documentation makes clear - if sensibly possible - whether a piece of information in a compiler documentation is generic (matches the spec) or is an implementation choice. In any case, the documentation should match what's implemented.) Note: That omp_low_lat_mem_space == omp_default_mem_space is already documented at https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html Maybe the wording needs to be tweaked now as nvptx + gcn actually handle the low-lat memory space differently. (While on the host, a failed 'malloc' is now repeated once, which is not really observable.) Or it is fine and "See also" to the target-specific section is enough. (BTW: Wording/documentation suggestions and/or patches are welcome!) * * *
+ /* Otherwise, we've already performed default mem allocation + and if that failed, it won't succeed again (unless it was + intermittent). Return NULL then, as that is the fallback. */
Thanks for adding the missing ')'. (Twice.) And thanks for the patch set in general. Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955