> For example, it's documented that 'cuMemHostAlloc', > <https://docs.nvidia.com/cuda/cuda-driver- > api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b > 9>, > "Allocates page-locked host memory". The crucial thing, though, what > makes this different from 'malloc' plus 'mlock' is, that "The driver > tracks the virtual memory ranges allocated with this function and > automatically accelerates calls to functions such as cuMemcpyHtoD(). > Since the memory can be accessed directly by the device, it can be read > or written with much higher bandwidth than pageable memory obtained with > functions such as malloc()".
OK, interesting. I had not seen this, but I think it confirms that the performance difference is within Cuda and regular locked memory is not so great. > Also, by means of the Nvidia Driver allocating the memory, I suppose > using this interface likely circumvents any "annoying" 'ulimit' > limitations? Yes, this is the case. > If not directly *allocating and registering* such memory via > 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only > *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister', > <https://docs.nvidia.com/cuda/cuda-driver- > api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b5422 > 3>: > "Page-locks the memory range specified [...] and maps it for the > device(s) [...]. This memory range also is added to the same tracking > mechanism as cuMemHostAlloc to automatically accelerate [...]"? (No > manual 'mlock'ing involved in that case, too; presumably again using this > interface likely circumvents any "annoying" 'ulimit' limitations?) > > Such a *register* abstraction can then be implemented by all the libgomp > offloading plugins: they just call the respective > CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.) > memory. > > ..., but maybe I'm missing some crucial "detail" here? I'm investigating this stuff for the AMD USM implementation as well right now. It might be a good way to handle static and stack data too. Or not. Andrew