Hi,

On Thu, Jan 05, 2017 at 09:33:55AM +0800, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim <minc...@kernel.org> writes:
> [snip]
> >
> > The patchset has used several techniqueus to reduce lock contention, for 
> > example,
> > batching alloc/free, fine-grained lock and cluster distribution to avoid 
> > cache
> > false-sharing. Each items has different complexity and benefits so could you
> > show the number for each step of pathchset? It would be better to include 
> > the
> > nubmer in each description. It helps how the patch is important when we 
> > consider
> > complexitiy of the patch.
> 
> Here is the test data.

Thanks!

> 
> We test the vm-scalability swap-w-seq test case with 32 processes on a
> Xeon E5 v3 system.  The swap device used is a RAM simulated PMEM
> (persistent memory) device.  To test the sequential swapping out, the
> test case created 32 processes, which sequentially allocate and write to
> the anonymous pages until the RAM and part of the swap device is used
> up.
> 
> The patchset is rebased on v4.9-rc8.  So the baseline performance is as
> follow,
> 
>   "vmstat.swap.so": 1428002,

What does it mean? vmstat.pswpout?

>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list":
>  13.94,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>  13.75,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list":
>  7.05,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage":
>  7.03,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon":
>  7.02,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list":
>  6.83,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk":
>  0.81,

Numbers mean overhead percentage reported by perf?

> 
> >> Patch 1 is a clean up patch.
> >
> > Could it be separated patch?
> >
> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
> >>         that can be used for accessing swap_map, and not lock the whole
> >>         swap device
> 
> After patch 2, the result is as follow,
> 
>   "vmstat.swap.so": 1481704,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list":
>  27.53,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>  27.01,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages":
>  1.03,
> 
> The swap out throughput is at the same level, but the lock contention on
> swap_info_struct->lock is eliminated.
> 
> >> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
> >>         the rate that we have to contende for the radix tree.
> >
> 
> After patch 3,
> 
>   "vmstat.swap.so": 2050097,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list":
>  43.27,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault":
>  4.84,
> 
> The swap out throughput is improved about ~43% compared with baseline.
> The lock contention on swap cache radix tree lock is eliminated.
> swap_info_struct->lock in get_swap_page() becomes the most heavy
> contended lock.

The numbers are great! Please include those into each patchset.
And I ask one more thing I said earlier about patch 2.

""
I hope you make three steps to review easier. You can create some functions like
swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
It doesn't change anything performance pov but it clearly shows what kinds of 
lock
we should use in specific context.

Then, you can introduce more fine-graind lock in next patch and apply it into
those wrapper functions.
 
And last patch, you can adjust cluster distribution to avoid false-sharing.
And the description should include how it's bad in testing so it's worth.
""

It makes review more easier, I believe.

> 
> >
> >> Patch 4 eliminates unnecessary page allocation for read ahead.
> >
> > Could it be separated patch?
> >
> >> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
> >>         to contend on the swap device to get a swap slot or to release
> >>         a swap slot.  And we allocate and release the swap slots
> >>         in batches for better efficiency.
> 
> After patch 9,
> 
>   "vmstat.swap.so": 4170746,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range":
>  13.91,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault":
>  8.56,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma":
>  2.56,
>   
> "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list":
>  2.47,
> 
> The swap out throughput is improved about 192% compared with the
> baseline.  There are still some lock contention for
> swap_info_struct->lock, but the pressure begins to shift to buddy system
> now.
> 
> Best Regards,
> Huang, Ying
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org";> em...@kvack.org </a>

Reply via email to