On 12/20/21 16:58, Andrew Stubbs wrote:
This patch is submitted now for review and so I can commit a backport it
to the OG11 branch, but isn't suitable for mainline until stage 1.
The patch implements support for omp_low_lat_mem_space and
omp_low_lat_mem_alloc on NVPTX offload devices. The omp_pteam_mem_alloc,
omp_cgroup_mem_alloc and omp_thread_mem_alloc allocators are also
configured to use this space (this to match the current or intended
behaviour in other toolchains).
The memory is drawn from the ".shared" space that is accessible only
from within the team in which it is allocated, and which effectively
ceases to exist when the kernel exits. By default, 8 KiB of space is
reserved for each team at launch time. This can be adjusted, at runtime,
via a new environment variable "GOMP_NVPTX_LOWLAT_POOL". Reserving a
larger amount may limit the number of teams that can be run in parallel
(due to hardware limitations). Conversely, reducing the allocation may
increase the number of teams that can be run in parallel. (I have not
yet attempted to tune the default too precisely.) The actual maximum
size will vary according to the available hardware and the number of
variables that the compiler has placed in .shared space.
The allocator implementation is designed to add no extra space-overhead
than omp_alloc already does (aside from rounding allocations up to a
multiple of 8 bytes), thus the internal free and realloc must be told
how big the original allocation was. The free algorithm maintains an
in-order linked-list of free memory chunks. Memory is allocated on a
first-fit basis.
If the allocation fails the NVPTX allocator returns NULL and omp_alloc
handles the fall-back. Now that this is a thing that is likely to happen
(low-latency memory is small) this patch also implements appropriate
fall-back modes for the predefined allocators (fall-back for custom
allocators already worked).
In order to support the %dynamic_smem_size PTX feature is is necessary
to bump the minimum supported PTX version from 3.1 (~2013) to 4.1 (~2014).
I applied the patch (but used the libgomp/configure.tgt patch to force
-mptx=4.1, rather than changing the default).
I ran into the following (using export GOMP_NVPTX_JIT=-O0 to work around
known driver problems), and observed these extra FAILs:
...
FAIL: libgomp.c/../libgomp.c-c++-common/alloc-7.c execution test
FAIL: libgomp.c/../libgomp.c-c++-common/alloc-8.c execution test
FAIL: libgomp.c/allocators-1.c (test for excess errors)
FAIL: libgomp.c/allocators-2.c (test for excess errors)
FAIL: libgomp.c/allocators-3.c (test for excess errors)
FAIL: libgomp.c/allocators-4.c (test for excess errors)
FAIL: libgomp.c/allocators-5.c (test for excess errors)
FAIL: libgomp.c/allocators-6.c (test for excess errors)
FAIL: libgomp.c++/../libgomp.c-c++-common/alloc-7.c execution test
FAIL: libgomp.c++/../libgomp.c-c++-common/alloc-8.c execution test
FAIL: libgomp.fortran/alloc-10.f90 -O execution test
FAIL: libgomp.fortran/alloc-9.f90 -O execution test
...
The allocators-1.c test-case doesn't compile because:
...
FAIL: libgomp.c/allocators-1.c (test for excess errors)
Excess errors:
/home/vries/oacc/trunk/source-gcc/libgomp/testsuite/libgomp.c/allocators-1.c:7:22:
sorry, unimplemented: ' ' clause on 'requires' directive not supported yet
UNRESOLVED: libgomp.c/allocators-1.c compilation failed to produce
executable
...
So, I suppose I need "[PATCH] OpenMP front-end: allow requires
dynamic_allocators" as well, I'll try again with that applied.
The alloc-7.c execution test failure is a regression, AFAICT. It fails
here:
...
38 if ((((uintptr_t) p) % __alignof (int)) != 0 || p[0] || p[1]
|| p[2])
39 abort ();
...
because:
...
(gdb) p p[0]
$2 = 772014104
(gdb) p p[1]
$3 = 0
(gdb) p p[2]
$4 = 9
...
In other words, the pointer returned by omp_calloc does not point to
zeroed out memory.
Thanks,
- Tom