Ok, so relating to the freelist I have an idea for uvm_fpageqlock. This would be a mid term solution not taking direct account of NUMA and so forth.
Looking at all of the accounting data we have that decides how the system behaves such as uvmexp.free and so on (currently protected by uvm_fpageqlock), we don't really need locked access to these when reading because we continually check and re-recheck those values. So the system will eventally sort itself out even if we get a bad picture of things from time to time. All that really matters is that we maintain the values consistently, using atomics or locks. So uvm_fpageqlock's not needed there. uvm_fpageqlock also protects one set of data that is not directly related to free memory and that's the pagedaemon wakeup and pageout state. We could put in a new low traffic mutex there, say uvm_pageout_lock. (Incidentally it looks like updates to uvmexp.pdpending might be racy, just noting it here so I remember.) So that leaves only the page allocator needing uvm_fpageqlock. Currently the page allocator maintains per-CPU and global lists of free pages. Pages reside on both lists. We prefer to hand out pages from the per-CPU list: on machines with physically indexed caches, it's likely that we'll have lines from those pages in cache on the local CPU, which is beneficial when it comes time to fill the pages. All lists are protected by uvm_fpageqlock. What I propose is to maintain the global list of pages pretty much as is, but to split off the per-CPU lists so that they would have their own locks. With the exception of uvm_pglistalloc() they would only be accessed by the local CPU, effectively functioning as a local cache of free pages. When allocating, we'd try the local list first and then try the global list if no pages are available. When freeing, we'd always put back to the local list. When allocating from and freeing back to this local list of free pages we would not touch any global state, even uvmexp.free. The idlezero code would only consider the local list of pages. At some point we'd need to redistribute those cached pages back to the global list of free pages. This would be a fairly neat and tidy operation as all we'd need to do is go through the color buckets, chop the list of pages out and splice it into the head of the global list, then do some accounting updates (e.g. uvmexp.free). I'm thinking this redistribution should happen fairly regularly so perhaps we could change the xcall thread on each CPU to awaken once per second. Change cv_wait() in xc_thread() into a cv_timedwait(), and have it hand back cached pages if (a) not done recently or (b) the system is struggling. The pagedaemon would get code to directly trigger the redistribution when under pressure but I am thinking that some sort of rate limiting would be needed. Thoughts?
