On 04/01/2022 18:47, Jakub Jelinek wrote:
On Tue, Jan 04, 2022 at 07:28:29PM +0100, Jakub Jelinek via Gcc-patches wrote:
Other issues in the patch are that it doesn't munlock on deallocation and
that because of that deallocation we need to figure out what to do on page
boundaries. As documented, mlock can be passed address and/or address +
size that aren't at page boundaries and pinning happens even just for
partially touched pages. But munlock unpins also even the partially
overlapping pages and we don't know at that point whether some other pinned
allocations don't appear in those pages.
Right, it doesn't munlock because of these issues. I don't know of any way
to solve this that wouldn't involve building tables of locked ranges (and
knowing what the page size is).
I considered using mmap with the lock flag instead, but the failure mode
looked unhelpful. I guess we could mmap with the regular flags, then mlock
after. That should bypass the regular heap and ensure each allocation has
it's own page. I'm not sure what the unintended side-effects of that might
be.
But the munlock is even more important because of the low ulimit -l, because
if munlock isn't done on deallocation, the by default I think 64KB limit
will be reached even much earlier. If most users have just 64KB limit on
pinned memory per process, then that most likely asks for grabbing such memory
in whole pages and doing memory management on that resource.
Because vasting that precious memory on the partial pages which will most
likely get non-pinned allocations when we just have 16 such pages is a big
waste.
E.g. if we start using (dynamically, using dlopen/dlsym etc.) the memkind
library for some of the allocators, for the pinned memory we could use
e.g. the memkind_create_fixed API - on the first pinned allocation, check
what is the ulimit -l and if it is fairly small, mmap PROT_NONE the whole
pinned size (but don't pin it whole at start, just whatever we need as we
go).
I don't believe 64KB will be anything like enough for any real HPC
application. Is it really worth optimizing for this case?
Anyway, I'm working on an implementation using mmap instead of malloc
for pinned allocations. I figure that will simplify the unpin algorithm
(because it'll be munmap) and optimize for large allocations such as I
imagine HPC applications will use. It won't fix the ulimit issue.
Andrew