On Sun, Dec 29, 2019 at 03:39:55AM +0100, Oliver Pinter wrote: > Is there any performance measurement from before and after. It would be > nice to see them.
I did not do extensive benchmarking. The aim of the patch set was simply to remove the use of the hashed page lock, since it shows up prominently in lock profiles of some workloads. The problem is that we acquire these locks any time a page's LRU state is updated, and the use of the hash lock means that we get false sharing. The solution is to implement these state updates using atomic operations on the page structure itself, making data contention much less likely. Another option was to embed a mutex into the vm_page structure, but this would bloat a structure which is already too large. A secondary goal was to reduce the number of locks held during page queue scans. Such scans frequently call pmap_ts_referenced() to collect info about recent references to the page. This operation can be expensive since it may require a TLB shootdown, and it can block for a long time on the pmap lock, for example if the lock holder is copying the page tables as part of a fork(). Now, the active queue scan body is executed without any locks held, so a page daemon thread blocked on a pmap lock no longer has the potential to block other threads by holding on to a shared page lock. Before, the page daemon could block faulting threads for a long time, hurting latency. I don't have any benchmarks that capture this, but it's something that I've observed in production workloads. I used some microbenchmarks to verify that the change did not penalize the single-threaded case. Here are some results on a 64-core arm64 system I have been playing with: https://people.freebsd.org/~markj/arm64_page_lock/ The benchmark from will-it-scale simply maps 128MB of anonymous memory, faults on each page, and unmaps it, in a loop. In the fault handler we allocate a page and insert it into the active queue, and the unmap operation removes all of those pages from the queue. I collected the throughput for 1, 2, 4, 8, 16 and 32 concurrent processes. With my patches we see some modest gains at low concurrency. At higher levels of concurrency we actually get lower throughput than before as contention moves from the page locks and the page queue lock to just the page queue lock. I don't believe this is a real regression: first, the benchmark is quite extreme relative to any useful workload, and second, arm64 suffers from using a much smaller batch size than amd64 for batched page queue operations. Changing that pushes the results out somewhat. Some earlier testing on a 2-socket Xeon system showed a similar pattern with smaller differences. _______________________________________________ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"