On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote: > Following some feedback from users of the OG11 branch I think I need to > withdraw this patch, for now. > > The memory pinned via the mlock call does not give the expected performance > boost. I had not expected that it would do much in my test setup, given that > the machine has a lot of RAM and my benchmarks are small, but others have > tried more and on varying machines and architectures.
I don't understand why there should be any expected performance boost (at least not unless the machine starts swapping out pages), { omp_atk_pinned, true } is solely about the requirement that the memory can't be swapped out. > It seems that it isn't enough for the memory to be pinned, it has to be > pinned using the Cuda API to get the performance boost. I had not done this For performance boost of what kind of code? I don't understand how Cuda API could be useful (or can be used at all) if offloading to NVPTX isn't involved. The fact that somebody asks for host memory allocation with omp_atk_pinned set to true doesn't mean it will be in any way related to NVPTX offloading (unless it is in NVPTX target region obviously, but then mlock isn't available, so sure, if there is something CUDA can provide for that case, nice). > I don't think libmemkind will resolve this performance issue, although > certainly it can be used for host implementations of low-latency memories, > etc. The reason for libmemkind is primarily its support of HBW memory (but admittedly I need to find out what kind of such memory it does support), or the various interleaving etc. the library has. Plus, when we have such support, as it has its own costomizable allocator, it could be used to allocate larger chunks of memory that can be mlocked and then just allocate from that pinned memory if user asks for small allocations from that memory. Jakub