Hi, On Thu, Jan 05, 2017 at 09:33:55AM +0800, Huang, Ying wrote: > Hi, Minchan, > > Minchan Kim <minc...@kernel.org> writes: > [snip] > > > > The patchset has used several techniqueus to reduce lock contention, for > > example, > > batching alloc/free, fine-grained lock and cluster distribution to avoid > > cache > > false-sharing. Each items has different complexity and benefits so could you > > show the number for each step of pathchset? It would be better to include > > the > > nubmer in each description. It helps how the patch is important when we > > consider > > complexitiy of the patch. > > Here is the test data.
Thanks! > > We test the vm-scalability swap-w-seq test case with 32 processes on a > Xeon E5 v3 system. The swap device used is a RAM simulated PMEM > (persistent memory) device. To test the sequential swapping out, the > test case created 32 processes, which sequentially allocate and write to > the anonymous pages until the RAM and part of the swap device is used > up. > > The patchset is rebased on v4.9-rc8. So the baseline performance is as > follow, > > "vmstat.swap.so": 1428002, What does it mean? vmstat.pswpout? > > "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": > 13.94, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg": > 13.75, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list": > 7.05, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage": > 7.03, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon": > 7.02, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": > 6.83, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk": > 0.81, Numbers mean overhead percentage reported by perf? > > >> Patch 1 is a clean up patch. > > > > Could it be separated patch? > > > >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock > >> that can be used for accessing swap_map, and not lock the whole > >> swap device > > After patch 2, the result is as follow, > > "vmstat.swap.so": 1481704, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": > 27.53, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg": > 27.01, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages": > 1.03, > > The swap out throughput is at the same level, but the lock contention on > swap_info_struct->lock is eliminated. > > >> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing > >> the rate that we have to contende for the radix tree. > > > > After patch 3, > > "vmstat.swap.so": 2050097, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": > 43.27, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": > 4.84, > > The swap out throughput is improved about ~43% compared with baseline. > The lock contention on swap cache radix tree lock is eliminated. > swap_info_struct->lock in get_swap_page() becomes the most heavy > contended lock. The numbers are great! Please include those into each patchset. And I ask one more thing I said earlier about patch 2. "" I hope you make three steps to review easier. You can create some functions like swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock. It doesn't change anything performance pov but it clearly shows what kinds of lock we should use in specific context. Then, you can introduce more fine-graind lock in next patch and apply it into those wrapper functions. And last patch, you can adjust cluster distribution to avoid false-sharing. And the description should include how it's bad in testing so it's worth. "" It makes review more easier, I believe. > > > > >> Patch 4 eliminates unnecessary page allocation for read ahead. > > > > Could it be separated patch? > > > >> Patch 5-9 create a per cpu cache of the swap slots, so we don't have > >> to contend on the swap device to get a swap slot or to release > >> a swap slot. And we allocate and release the swap slots > >> in batches for better efficiency. > > After patch 9, > > "vmstat.swap.so": 4170746, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range": > 13.91, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": > 8.56, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma": > 2.56, > > "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list": > 2.47, > > The swap out throughput is improved about 192% compared with the > baseline. There are still some lock contention for > swap_info_struct->lock, but the pressure begins to shift to buddy system > now. > > Best Regards, > Huang, Ying > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a>