Thanks for the detailed answer Mark! On Sunday, December 29, 2019, Mark Johnston <ma...@freebsd.org> wrote:
> On Sun, Dec 29, 2019 at 03:39:55AM +0100, Oliver Pinter wrote: > > Is there any performance measurement from before and after. It would be > > nice to see them. > > I did not do extensive benchmarking. The aim of the patch set was > simply to remove the use of the hashed page lock, since it shows up > prominently in lock profiles of some workloads. The problem is that we > acquire these locks any time a page's LRU state is updated, and the use > of the hash lock means that we get false sharing. The solution is to > implement these state updates using atomic operations on the page > structure itself, making data contention much less likely. Another > option was to embed a mutex into the vm_page structure, but this would > bloat a structure which is already too large. > > A secondary goal was to reduce the number of locks held during page > queue scans. Such scans frequently call pmap_ts_referenced() to collect > info about recent references to the page. This operation can be > expensive since it may require a TLB shootdown, and it can block for a > long time on the pmap lock, for example if the lock holder is copying > the page tables as part of a fork(). Now, the active queue scan body is > executed without any locks held, so a page daemon thread blocked on a > pmap lock no longer has the potential to block other threads by holding > on to a shared page lock. Before, the page daemon could block faulting > threads for a long time, hurting latency. I don't have any benchmarks > that capture this, but it's something that I've observed in production > workloads. > > I used some microbenchmarks to verify that the change did not penalize > the single-threaded case. Here are some results on a 64-core arm64 > system I have been playing with: > https://people.freebsd.org/~markj/arm64_page_lock/ > > The benchmark from will-it-scale simply maps 128MB of anonymous memory, > faults on each page, and unmaps it, in a loop. In the fault handler we > allocate a page and insert it into the active queue, and the unmap > operation removes all of those pages from the queue. I collected the > throughput for 1, 2, 4, 8, 16 and 32 concurrent processes. > > With my patches we see some modest gains at low concurrency. At higher > levels of concurrency we actually get lower throughput than before as > contention moves from the page locks and the page queue lock to just the > page queue lock. I don't believe this is a real regression: first, the > benchmark is quite extreme relative to any useful workload, and second, > arm64 suffers from using a much smaller batch size than amd64 for > batched page queue operations. Changing that pushes the results out > somewhat. Some earlier testing on a 2-socket Xeon system showed a > similar pattern with smaller differences. > _______________________________________________ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"