On 08.12.22 13:51, Andrew Stubbs wrote:
On 08/12/2022 12:11, Jakub Jelinek wrote:
On Thu, Jul 07, 2022 at 11:34:33AM +0100, Andrew Stubbs wrote:
Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall. Pinned allocations are performed using mmap, not malloc,
to ensure
that they can be unpinned safely when freed.
As I said before, I think the pinned memory is too precious to waste
it this
way, we should handle the -> pinned case through memkind_create_fixed on
mmap + mlock area, that way we can create even quite small pinned
allocations.
This has been delayed due to other priorities, but our current plan is
to switch to using cudaHostAlloc, when available, but we can certainly
use memkind_create_fixed for the fallback case (including amdgcn).
With available, I assume that nvptx is an 'available device' (per OpenMP
definition, finally added in TR11), i.e. there is an image for nvptx and
- after omp_requires filtering - there remains at least one nvptx device.
* * *
For completeness, I want to note that OpenMP TR11 adds support for
creating memory spaces that are accessible from multiple devices, e.g.
host + one/all devices, and adds some convenience functions for the
latter (all devices, host and a specific device etc.) →
https://openmp.org/specifications/ TR11 (see Appendix B.2 for the
release notes, esp. for Section 6.2).
I think it makes sense to keep those addition in mind when doing the
actual implementation to avoid incompatibilities.
Side note regarding ompx_ additions proposed in
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597979.html (adds
ompx_pinned_mem_alloc),
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/597983.html
(ompx_unified_shared_mem_alloc and ompx_host_mem_alloc;
ompx_unified_shared_mem_space and ompx_host_mem_space).
While TR11 does not add any predefined allocators or new memory spaces,
using e.g. omp_get_devices_all_allocator(memspace) returns a
unified-shared-memory allocator.
I note that LLVM does not seem to have any ompx_ in this regard (yet?).
(It has some ompx_ – but related to assumptions.)
Using Cuda might be trickier to implement because there's a layering
violation inherent in routing target independent allocations through
the nvptx plugin, but benchmarking shows that that's the only way to
get the faster path through the Cuda black box; being pinned is good
because it avoids page faults, but apparently if Cuda *knows* it is
pinned then you get a speed boost even when there would be *no* faults
(i.e. on a quiet machine). Additionally, Cuda somehow ignores the
OS-defining limits.
I wonder whether for a NUMA machine (and non-offloading access), using
memkind_create_fixed will have an advantage over cuHostAlloc or not.
(BTW, I find cuHostAlloc vs. cuAllocHost confusing.) And if so, whether
we should provide a means (GOMP_... env var?) to toggle the preference.
My feeling is that, on most systems, it does not matter - except (a)
possibly for large NUMA systems, where the memkind tuning will probably
make a difference and (b) we know that CUDA's cu(HostAlloc/AllocHost) is
faster with nvptx offloading. (cu(HostAlloc/AllocHost) also permits DMA
from the device. (If unified-shared address is supported, but that's the
case [cf. comment + assert in plugin-nvptx.c].)
Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
München, HRB 106955