Re: [PATCH 2/3 v2] Optimize CRC32C calculation with PCLMULQDQ instruction
On Tue, 2013-02-26 at 17:54 +0800, Herbert Xu wrote: > On Thu, Sep 27, 2012 at 03:44:22PM -0700, Tim Chen wrote: > > This patch adds the crc_pcl function that calculates CRC32C checksum using > > the > > PCLMULQDQ instruction on processors that support this feature. This will > > provide speedup over using CRC32 instruction only. > > The usage of PCLMULQDQ necessitate the invocation of kernel_fpu_begin and > > kernel_fpu_end and incur some overhead. So the new crc_pcl function is only > > invoked for buffer size of 512 bytes or more. Larger sized > > buffers will expect to see greater speedup. This feature is best used > > coupled > > with eager_fpu which reduces the kernel_fpu_begin/end overhead. For > > buffer size of 1K the speedup is around 1.6x and for buffer size greater > > than > > 4K, the speedup is around 3x compared to original implementation in > > crc32c-intel > > module. Test was performed on Sandy Bridge based platform with constant > > frequency > > set for cpu. > > > > A white paper detailing the algorithm can be found here: > > http://download.intel.com/design/intarch/papers/323405.pdf > > > > Signed-off-by: Tim Chen Acked. BTW, Herbert, I've also sent you a patch a few days ago to update the link to the whitepaper on the CRC32C algorithm in the code. Wonder if you have received it. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Update the links to the white papers on CRC32C calculations with PCLMULQDQ instructions.
Herbert, The following patch update the stale link to the CRC32C white paper that was referenced. Tim Signed-off-by: Tim Chen --- arch/x86/crypto/crc32c-pcl-intel-asm_64.S |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S index 93c6d39..f3f1e42 100644 --- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S +++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S @@ -1,9 +1,10 @@ /* * Implement fast CRC32C with PCLMULQDQ instructions. (x86_64) * - * The white paper on CRC32C calculations with PCLMULQDQ instruction can be + * The white papers on CRC32C calculations with PCLMULQDQ instruction can be * downloaded from: - * http://download.intel.com/design/intarch/papers/323405.pdf + * http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf + * http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-paper.pdf * * Copyright (C) 2012 Intel Corporation. * -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex
On Tue, 2012-08-21 at 17:48 -0700, Tim Chen wrote: > > Thanks to Matthew's suggestions on improving the patch. Here's the > updated version. It seems to be sane when I booted my machine up. I > will put it through more testing when I get a chance. > > Tim > Matthew, The new patch seems to be causing some of the workloads with mmaped file read to seg fault. Will need to dig further to find out why. Tim > --- > Signed-off-by: Tim Chen > --- > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index fd07c45..f1320b1 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -156,8 +156,11 @@ static inline void page_dup_rmap(struct page *page) > /* > * Called from mm/vmscan.c to handle paging out > */ > -int page_referenced(struct page *, int is_locked, > - struct mem_cgroup *memcg, unsigned long *vm_flags); > +int needs_page_mmap_mutex(struct page *page); > +int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, > + unsigned long *vm_flags); > +int __page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, > + unsigned long *vm_flags); > int page_referenced_one(struct page *, struct vm_area_struct *, > unsigned long address, unsigned int *mapcount, unsigned long *vm_flags); > > @@ -176,6 +179,7 @@ enum ttu_flags { > bool is_vma_temporary_stack(struct vm_area_struct *vma); > > int try_to_unmap(struct page *, enum ttu_flags flags); > +int __try_to_unmap(struct page *, enum ttu_flags flags); > int try_to_unmap_one(struct page *, struct vm_area_struct *, > unsigned long address, enum ttu_flags flags); > > diff --git a/mm/rmap.c b/mm/rmap.c > index 5b5ad58..ca8cd21 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -843,8 +843,7 @@ static int page_referenced_file(struct page *page, >* so we can safely take mapping->i_mmap_mutex. >*/ > BUG_ON(!PageLocked(page)); > - > - mutex_lock(&mapping->i_mmap_mutex); > + BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex)); > > /* >* i_mmap_mutex does not stabilize mapcount at all, but mapcount > @@ -869,21 +868,16 @@ static int page_referenced_file(struct page *page, > break; > } > > - mutex_unlock(&mapping->i_mmap_mutex); > return referenced; > } > > -/** > - * page_referenced - test if the page was referenced > - * @page: the page to test > - * @is_locked: caller holds lock on the page > - * @memcg: target memory cgroup > - * @vm_flags: collect encountered vma->vm_flags who actually referenced the > page > - * > - * Quick test_and_clear_referenced for all mappings to a page, > - * returns the number of ptes which referenced the page. > - */ > -int page_referenced(struct page *page, > +int needs_page_mmap_mutex(struct page *page) > +{ > + return page->mapping && page_mapped(page) && page_rmapping(page) && > + !PageKsm(page) && !PageAnon(page); > +} > + > +int __page_referenced(struct page *page, > int is_locked, > struct mem_cgroup *memcg, > unsigned long *vm_flags) > @@ -919,6 +913,32 @@ out: > return referenced; > } > > +/** > + * page_referenced - test if the page was referenced > + * @page: the page to test > + * @is_locked: caller holds lock on the page > + * @memcg: target memory cgroup > + * @vm_flags: collect encountered vma->vm_flags who actually referenced the > page > + * > + * Quick test_and_clear_referenced for all mappings to a page, > + * returns the number of ptes which referenced the page. > + */ > +int page_referenced(struct page *page, > + int is_locked, > + struct mem_cgroup *memcg, > + unsigned long *vm_flags) > +{ > + int result, needs_lock; > + > + needs_lock = needs_page_mmap_mutex(page); > + if (needs_lock) > + mutex_lock(&page->mapping->i_mmap_mutex); > + result = __page_referenced(page, is_locked, memcg, vm_flags); > + if (needs_lock) > + mutex_unlock(&page->mapping->i_mmap_mutex); > + return result; > +} > + > static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, > unsigned long address) > { > @@ -1560,7 +1580,7 @@ static int try_to_unmap_file(struct page *page, enum > ttu_flags flags) > unsigned long max_nl_size = 0; > unsigned int mapcount; > > - mutex_lock(&mapping->i_mmap_mutex)
Re: [RFC PATCH 2/2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex
On Tue, 2012-09-04 at 08:21 -0700, Tim Chen wrote: > On Tue, 2012-08-21 at 17:48 -0700, Tim Chen wrote: > > > > > Thanks to Matthew's suggestions on improving the patch. Here's the > > updated version. It seems to be sane when I booted my machine up. I > > will put it through more testing when I get a chance. > > > > Tim > > > > Matthew, > > The new patch seems to be causing some of the workloads with mmaped file > read to seg fault. Will need to dig further to find out why. > > Tim > Okay, the problem seems to be the code below. It is too restrictive and causes some cases where the mutex needs to be taken in try_to_unmap_file to be missed. > > +int needs_page_mmap_mutex(struct page *page) > > +{ > > + return page->mapping && page_mapped(page) && page_rmapping(page) && > > + !PageKsm(page) && !PageAnon(page); > > +} > > + Changing the check to the following fixes the problem: @@ -873,8 +873,7 @@ static int page_referenced_file(struct page *page, int needs_page_mmap_mutex(struct page *page) { - return page->mapping && page_mapped(page) && page_rmapping(page) && - !PageKsm(page) && !PageAnon(page); + return page->mapping && !PageKsm(page) && !PageAnon(page); } I'll do more testing and generate a second version of the patch set with the fixes. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3 v2] mm: Batch page reclamation under shink_page_list
On Wed, 2012-09-12 at 12:27 -0700, Andrew Morton wrote: > > That sounds good, although more details on the performance changes > would be appreciated - after all, that's the entire point of the > patchset. > > And we shouldn't only test for improvements - we should also test for > degradation. What workloads might be harmed by this change? I'd suggest > > - a single process which opens N files and reads one page from each > one, then repeats. So there are no contiguous LRU pages which share > the same ->mapping. Get some page reclaim happening, measure the > impact. > > - The batching means that we now do multiple passes over pageframes > where we used to do things in a single pass. Walking all those new > page lists will be expensive if they are lengthy enough to cause L1 > cache evictions. I need to address both your concerns and Mel's concerns about the downside of prolonging the holding page locks for the pages to be unmmaped for patch 1 in the series. I'll try to do some testing to see what kind of benefit I get by only batching operations under the i_mmap_mutex (i.e. patch 2 and 3 only) and not do batch unmap. Those other changes don't have the downsides of prolonged page locking and we can incorporate them with less risks. > > What would be a test for this? A simple, single-threaded walk > through a file, I guess? Thanks for your test suggestions. I will do tests along your suggestions when I generate the next iterations of the patch. I've been playing with these patches for a while and they are based on 3.4 kernel. I'll move them to 3.6 kernel in my next iteration. > > Mel's review comments were useful, thanks. Very much appreciate comments from you, Mel and Minchan. I'll try to incorporate them in my changes. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
On Tue, 2012-09-11 at 12:05 +0100, Mel Gorman wrote: > > One *massive* change here that is not called out in the changelog is that > the reclaim path now holds the page lock on multiple pages at the same > time waiting for them to be batch unlocked in __remove_mapping_batch. > This is suspicious for two reasons. > > The first suspicion is that it is expected that there are filesystems > that lock multiple pages in page->index order and page reclaim tries to > lock pages in a random order. You are "ok" because you trylock the pages > but there should be a comment explaining the situation and why you're > ok. > > My *far* greater concern is that the hold time for a locked page is > now potentially much longer. You could lock a bunch of filesystem pages > and then call pageout() on an swapcache page that takes a long time to > write. This potentially causes a filesystem (or flusher threads etc) > to stall on lock_page and that could cause all sorts of latency trouble. > It will be hard to hit this bug and diagnose it but I believe it's > there. > > That second risk *really* must be commented upon and ideally reviewed by > the filesystem people. However, I very strongly suspect that the outcome > of such a review will be a suggestion to unlock the pages and reacquire > the lock in __remove_mapping_batch(). Bear in mind that if you take this > approach that you *must* use trylock when reacquiring the page lock and > handle being unable to lock the page. > Mel, Thanks for your detailed comments and analysis. If I unlock the pages, will flusher threads be the only things that will touch them? Or do I have to worry about potentially other things done to the pages that will make it invalid for me to unmap the pages later and put them on free list? Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3 v2] mm: Batch page reclamation under shink_page_list
On Tue, 2012-09-11 at 14:36 +0900, Minchan Kim wrote: > > If you send next versions, please use git-format-patch --thread style. > Quote from man > " If given --thread, git-format-patch will generate In-Reply-To and > References >headers to make the second and subsequent patch mails appear as > replies to the >first mail; this also generates a Message-Id header to reference. > " > > It helps making your all patches in this series thread type in mailbox > so reviewers can find all patches related to a subject easily. > Thanks for your detailed review of the patch series. Will incorporate your suggestions on the next version. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3 v2] Optimize CRC32C calculation using PCLMULQDQ in crc32c-intel module
On Fri, 2012-09-28 at 10:57 +0800, Herbert Xu wrote: > 在 2012-9-28 上午10:54,"H. Peter Anvin" 写道: > > > > On 09/27/2012 03:44 PM, Tim Chen wrote: > >> > >> Version 2 > >> This version of the patch series fixes compilation errors for > >> 32 bit x86 targets. > >> > >> Version 1 > >> This patch series optimized CRC32C calculations with PCLMULQDQ > >> instruction for crc32c-intel module. It speeds up the original > >> implementation by 1.6x for 1K buffer and by 3x for buffer 4k or > >> more. The tcrypt module was enhanced for doing speed test > >> on crc32c calculations. > >> > > > > Herbert - you are handling this one, right? > > Yes thanks > > Hi Herbert, are there any other changes I should do on my end? Is the patchset ready to be picked up in crypto-dev? Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Rename crc32c-intel.c to crc32c-intel_glue.c
This patch rename the crc32c-intel.c file to crc32c-intel_glue.c file in preparation for linking with the new crc32c-pcl-intel-asm.S file, which contains optimized crc32c calculation based on PCLMULQDQ instruction. Tim Signed-off-by: Tim Chen --- arch/x86/crypto/Makefile |1 + .../crypto/{crc32c-intel.c => crc32c-intel_glue.c} |0 2 files changed, 1 insertions(+), 0 deletions(-) rename arch/x86/crypto/{crc32c-intel.c => crc32c-intel_glue.c} (100%) diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index e908e5d..edd2268 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -43,3 +43,4 @@ serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o +crc32c-intel-y := crc32c-intel_glue.o diff --git a/arch/x86/crypto/crc32c-intel.c b/arch/x86/crypto/crc32c-intel_glue.c similarity index 100% rename from arch/x86/crypto/crc32c-intel.c rename to arch/x86/crypto/crc32c-intel_glue.c -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] Optimize CRC32C calculation with PCLMULQDQ instruction
This patch adds the crc_pcl function that calculates CRC32C checksum using the PCLMULQDQ instruction on processors that support this feature. This will provide speedup over using CRC32 instruction only. The usage of PCLMULQDQ necessitate the invocation of kernel_fpu_begin and kernel_fpu_end and incur some overhead. So the new crc_pcl function is only invoked for buffer size of 512 bytes or more. Larger sized buffers will expect to see greater speedup. This feature is best used coupled with eager_fpu which reduces the kernel_fpu_begin/end overhead. For buffer size of 1K the speedup is around 1.6x and for buffer size greater than 4K, the speedup is around 3x compared to original implementation in crc32c-intel module. Test was performed on Sandy Bridge based platform with constant frequency set for cpu. A white paper detailing the algorithm can be found here: http://download.intel.com/design/intarch/papers/323405.pdf Tim Signed-off-by: Tim Chen --- arch/x86/crypto/Makefile |2 +- arch/x86/crypto/crc32c-intel_glue.c| 75 + arch/x86/crypto/crc32c-pcl-intel-asm.S | 460 3 files changed, 536 insertions(+), 1 deletions(-) create mode 100644 arch/x86/crypto/crc32c-pcl-intel-asm.S diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index edd2268..0babdcb 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -43,4 +43,4 @@ serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o -crc32c-intel-y := crc32c-intel_glue.o +crc32c-intel-y := crc32c-pcl-intel-asm.o crc32c-intel_glue.o diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c index 493f959..432af71 100644 --- a/arch/x86/crypto/crc32c-intel_glue.c +++ b/arch/x86/crypto/crc32c-intel_glue.c @@ -32,9 +32,31 @@ #include #include +#include +#include #define CHKSUM_BLOCK_SIZE 1 #define CHKSUM_DIGEST_SIZE 4 +/* + * use carryless multiply version of crc32c when buffer + * size is >= 512 (when eager fpu is enabled) or + * >= 1024 (when eager fpu is disabled) to account + * for fpu state save/restore overhead. + */ +#define CRC32C_PCL_BREAKEVEN_EAGERFPU 512 +#define CRC32C_PCL_BREAKEVEN_NOEAGERFPU1024 + +static int crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_EAGERFPU; +#if defined(X86_FEATURE_EAGER_FPU) +#define set_pcl_breakeven_point() \ +do { \ + if (!use_eager_fpu()) \ + crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU; \ +} while (0) +#else +#define set_pcl_breakeven_point() \ + (crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU) +#endif #define SCALE_Fsizeof(unsigned long) @@ -44,6 +66,9 @@ #define REX_PRE #endif +asmlinkage unsigned int crc_pcl(const u8 *buffer, int len, + unsigned int crc_init); + static u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t length) { while (length--) { @@ -117,6 +142,24 @@ static int crc32c_intel_update(struct shash_desc *desc, const u8 *data, return 0; } +static int crc32c_pcl_intel_update(struct shash_desc *desc, const u8 *data, + unsigned int len) +{ + u32 *crcp = shash_desc_ctx(desc); + + /* +* use faster PCL version if datasize is large enough to +* overcome kernel fpu state save/restore overhead +*/ + if (len >= crc32c_pcl_breakeven && irq_fpu_usable()) { + kernel_fpu_begin(); + *crcp = crc_pcl(data, len, *crcp); + kernel_fpu_end(); + } else + *crcp = crc32c_intel_le_hw(*crcp, data, len); + return 0; +} + static int __crc32c_intel_finup(u32 *crcp, const u8 *data, unsigned int len, u8 *out) { @@ -124,12 +167,31 @@ static int __crc32c_intel_finup(u32 *crcp, const u8 *data, unsigned int len, return 0; } +static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int len, + u8 *out) +{ + if (len >= crc32c_pcl_breakeven && irq_fpu_usable()) { + kernel_fpu_begin(); + *(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp)); + kernel_fpu_end(); + } else + *(__le32 *)out = + ~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len)); + return 0; +} + static int crc32c_intel_finup(struct shash_desc *desc, const u8 *data, unsigned int len, u8 *out) { return __crc32c_intel_finup(shash_de
[PATCH 3/3] Added speed test in tcrypt for crc32c
This patch adds a test case in tcrypt to perform speed test for crc32c checksum calculation. Tim Signed-off-by: Tim Chen --- crypto/tcrypt.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 581081d..6deb77f 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -1437,6 +1437,10 @@ static int do_test(int m) test_hash_speed("ghash-generic", sec, hash_speed_template_16); if (mode > 300 && mode < 400) break; + case 319: + test_hash_speed("crc32c", sec, generic_hash_speed_template); + if (mode > 300 && mode < 400) break; + case 399: break; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] Optimize CRC32C calculation using PCLMULQDQ in crc32c-intel module
This patch series optimized CRC32C calculations with PCLMULQDQ instruction for crc32c-intel module. It speeds up the original implementation by 1.6x for 1K buffer and by 3x for buffer 4k or more. The tcrypt module was enhanced for doing speed test on crc32c calculations. Tim Signed-off-by: Tim Chen --- Tim Chen (3): Rename crc32c-intel.c to crc32c-intel_glue.c Optimize CRC32C calculation with PCLMULQDQ instruction Added speed test in tcrypt for crc32c arch/x86/crypto/Makefile |1 + .../crypto/{crc32c-intel.c => crc32c-intel_glue.c} | 75 arch/x86/crypto/crc32c-pcl-intel-asm.S | 460 crypto/tcrypt.c|4 + 4 files changed, 540 insertions(+), 0 deletions(-) rename arch/x86/crypto/{crc32c-intel.c => crc32c-intel_glue.c} (70%) create mode 100644 arch/x86/crypto/crc32c-pcl-intel-asm.S -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Avoid useless inodes and dentries reclamation
This patch detects that when free inodes and dentries are really low, their reclamation is skipped so we do not have to contend on the global sb_lock uselessly under memory pressure. Otherwise we create a log jam trying to acquire the sb_lock in prune_super(), with little or no freed memory to show for the effort. The profile below shows a multi-threaded large file read exerting pressure on memory with page cache usage. It is dominated by the sb_lock contention in the cpu cycles profile. The patch eliminates the sb_lock contention almost entirely for prune_super(). 43.94% usemem [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--32.44%-- grab_super_passive | prune_super | shrink_slab | do_try_to_free_pages | try_to_free_pages | __alloc_pages_nodemask | alloc_pages_current | |--32.18%-- put_super | drop_super | prune_super | shrink_slab | do_try_to_free_pages | try_to_free_pages | __alloc_pages_nodemask | alloc_pages_current Signed-off-by: Tim Chen --- fs/super.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/super.c b/fs/super.c index 68307c0..70fa26c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we * take a passive reference to the superblock to avoid this from occurring. */ +#define SB_CACHE_LOW 5 static int prune_super(struct shrinker *shrink, struct shrink_control *sc) { struct super_block *sb; @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) return -1; + /* +* Don't prune if we have few cached objects to reclaim to +* avoid useless sb_lock contention +*/ + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) + return -1; + if (!grab_super_passive(sb)) return -1; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid useless inodes and dentries reclamation
On Thu, 2013-08-29 at 00:19 +0300, Kirill A. Shutemov wrote: > > --- > > fs/super.c | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/fs/super.c b/fs/super.c > > index 68307c0..70fa26c 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > > * take a passive reference to the superblock to avoid this from occurring. > > */ > > +#define SB_CACHE_LOW 5 > > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > { > > struct super_block *sb; > > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct > > shrink_control *sc) > > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > > return -1; > > > > + /* > > +* Don't prune if we have few cached objects to reclaim to > > +* avoid useless sb_lock contention > > +*/ > > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > > + return -1; > > I don't think it's correct: you don't account fs_objects here and > prune_icache_sb() calls invalidate_mapping_pages() which can free a lot of > memory. It's too naive approach. You can miss a memory hog easily this > way. Is it safe to compute sb->s_op->nr_cached_objects(sb), assuming non null s_op without holding sb_lock to increment ref count on sb? I think it is safe as we hold the shrinker_rwsem so we cannot unregister the shrinker and the s_op and sb structure should still be there. However, I'm not totally sure. Tim Signed-off-by: Tim Chen --- diff --git a/fs/super.c b/fs/super.c index 68307c0..173d0d9 100644 --- a/fs/super.c +++ b/fs/super.c @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we * take a passive reference to the superblock to avoid this from occurring. */ +#define SB_CACHE_LOW 5 static int prune_super(struct shrinker *shrink, struct shrink_control *sc) { struct super_block *sb; @@ -68,6 +69,17 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc) if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) return -1; + total_objects = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused; + if (sb->s_op && sb->s_op->nr_cached_objects) + total_objects += sb->s_op->nr_cached_objects(sb); + + /* +* Don't prune if we have few cached objects to reclaim to +* avoid useless sb_lock contention +*/ + if (total_objects <= SB_CACHE_LOW) + return -1; + if (!grab_super_passive(sb)) return -1; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Mon, 2013-09-30 at 12:36 -0400, Waiman Long wrote: > On 09/30/2013 12:10 PM, Jason Low wrote: > > On Mon, 2013-09-30 at 11:51 -0400, Waiman Long wrote: > >> On 09/28/2013 12:34 AM, Jason Low wrote: > Also, below is what the mcs_spin_lock() and mcs_spin_unlock() > functions would look like after applying the proposed changes. > > static noinline > void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node > *node) > { > struct mcs_spin_node *prev; > > /* Init node */ > node->locked = 0; > node->next = NULL; > > prev = xchg(lock, node); > if (likely(prev == NULL)) { > /* Lock acquired. No need to set node->locked since it > won't be used */ > return; > } > ACCESS_ONCE(prev->next) = node; > /* Wait until the lock holder passes the lock down */ > while (!ACCESS_ONCE(node->locked)) > arch_mutex_cpu_relax(); > smp_mb(); > >> I wonder if a memory barrier is really needed here. > > If the compiler can reorder the while (!ACCESS_ONCE(node->locked)) check > > so that the check occurs after an instruction in the critical section, > > then the barrier may be necessary. > > > > In that case, just a barrier() call should be enough. The cpu could still be executing out of order load instruction from the critical section before checking node->locked? Probably smp_mb() is still needed. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Tue, 2013-10-01 at 16:01 -0400, Waiman Long wrote: > On 10/01/2013 12:48 PM, Tim Chen wrote: > > On Mon, 2013-09-30 at 12:36 -0400, Waiman Long wrote: > >> On 09/30/2013 12:10 PM, Jason Low wrote: > >>> On Mon, 2013-09-30 at 11:51 -0400, Waiman Long wrote: > >>>> On 09/28/2013 12:34 AM, Jason Low wrote: > >>>>>> Also, below is what the mcs_spin_lock() and mcs_spin_unlock() > >>>>>> functions would look like after applying the proposed changes. > >>>>>> > >>>>>> static noinline > >>>>>> void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node > >>>>>> *node) > >>>>>> { > >>>>>>struct mcs_spin_node *prev; > >>>>>> > >>>>>>/* Init node */ > >>>>>>node->locked = 0; > >>>>>>node->next = NULL; > >>>>>> > >>>>>>prev = xchg(lock, node); > >>>>>>if (likely(prev == NULL)) { > >>>>>>/* Lock acquired. No need to set node->locked since > >>>>>> it > >>>>>> won't be used */ > >>>>>>return; > >>>>>>} > >>>>>>ACCESS_ONCE(prev->next) = node; > >>>>>>/* Wait until the lock holder passes the lock down */ > >>>>>>while (!ACCESS_ONCE(node->locked)) > >>>>>>arch_mutex_cpu_relax(); > >>>>>>smp_mb(); > >>>> I wonder if a memory barrier is really needed here. > >>> If the compiler can reorder the while (!ACCESS_ONCE(node->locked)) check > >>> so that the check occurs after an instruction in the critical section, > >>> then the barrier may be necessary. > >>> > >> In that case, just a barrier() call should be enough. > > The cpu could still be executing out of order load instruction from the > > critical section before checking node->locked? Probably smp_mb() is > > still needed. > > > > Tim > > But this is the lock function, a barrier() call should be enough to > prevent the critical section from creeping up there. We certainly need > some kind of memory barrier at the end of the unlock function. I may be missing something. My understanding is that barrier only prevents the compiler from rearranging instructions, but not for cpu out of order execution (as in smp_mb). So cpu could read memory in the next critical section, before node->locked is true, (i.e. unlock has been completed). If we only have a simple barrier at end of mcs_lock, then say the code on CPU1 is mcs_lock x = 1; ... x = 2; mcs_unlock and CPU 2 is mcs_lock y = x; ... mcs_unlock We expect y to be 2 after the "y = x" assignment. But we we may execute the code as CPU1CPU2 x = 1; ... y = x; ( y=1, out of order load) x = 2 mcs_unlock Check node->locked==true continue executing critical section (y=1 when we expect y=2) So we get y to be 1 when we expect that it should be 2. Adding smp_mb after the node->locked check in lock code ACCESS_ONCE(prev->next) = node; /* Wait until the lock holder passes the lock down */ while (!ACCESS_ONCE(node->locked)) arch_mutex_cpu_relax(); smp_mb(); should prevent this scenario. Thanks. Tim > > -Longman > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Tue, 2013-10-01 at 21:25 -0400, Waiman Long wrote: > On 10/01/2013 05:16 PM, Tim Chen wrote: > > On Tue, 2013-10-01 at 16:01 -0400, Waiman Long wrote: > >>> > >>> The cpu could still be executing out of order load instruction from the > >>> critical section before checking node->locked? Probably smp_mb() is > >>> still needed. > >>> > >>> Tim > >> But this is the lock function, a barrier() call should be enough to > >> prevent the critical section from creeping up there. We certainly need > >> some kind of memory barrier at the end of the unlock function. > > I may be missing something. My understanding is that barrier only > > prevents the compiler from rearranging instructions, but not for cpu out > > of order execution (as in smp_mb). So cpu could read memory in the next > > critical section, before node->locked is true, (i.e. unlock has been > > completed). If we only have a simple barrier at end of mcs_lock, then > > say the code on CPU1 is > > > > mcs_lock > > x = 1; > > ... > > x = 2; > > mcs_unlock > > > > and CPU 2 is > > > > mcs_lock > > y = x; > > ... > > mcs_unlock > > > > We expect y to be 2 after the "y = x" assignment. But we > > we may execute the code as > > > > CPU1CPU2 > > > > x = 1; > > ... y = x; ( y=1, out of order load) > > x = 2 > > mcs_unlock > > Check node->locked==true > > continue executing critical section (y=1 when we expect > > y=2) > > > > So we get y to be 1 when we expect that it should be 2. Adding smp_mb > > after the node->locked check in lock code > > > > ACCESS_ONCE(prev->next) = node; > > /* Wait until the lock holder passes the lock down */ > > while (!ACCESS_ONCE(node->locked)) > > arch_mutex_cpu_relax(); > > smp_mb(); > > > > should prevent this scenario. > > > > Thanks. > > Tim > > If the lock and unlock functions are done right, there should be no > overlap of critical section. So it is job of the lock/unlock functions > to make sure that critical section code won't leak out. There should be > some kind of memory barrier at the beginning of the lock function and > the end of the unlock function. > > The critical section also likely to have branches. The CPU may > speculatively execute code on the 2 branches, but one of them will be > discarded once the branch condition is known. Also > arch_mutex_cpu_relax() is a compiler barrier by itself. So we may not > need a barrier() after all. The while statement is a branch instruction, > any code after that can only be speculatively executed and cannot be > committed until the branch is done. But the condition code may be checked after speculative execution? The condition may not be true during speculative execution and only turns true when we check the condition, and take that branch? The thing that bothers me is without memory barrier after the while statement, we could speculatively execute before affirming the lock is in acquired state. Then when we check the lock, the lock is set to acquired state in the mean time. We could be loading some memory entry *before* the node->locked has been set true. I think a smp_rmb (if not a smp_mb) should be set after the while statement. At first I was also thinking that the memory barrier is not necessary but Paul convinced me otherwise in a previous email. https://lkml.org/lkml/2013/9/27/523 > > In x86, the smp_mb() function translated to a mfence instruction which > cost time. That is why I try to get rid of it if it is not necessary. > I also hope that the memory barrier is not necessary and I am missing something obvious. But I haven't been able to persuade myself. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 7/9] MCS Lock: Barrier corrections
This patch corrects the way memory barriers are used in the MCS lock and removes ones that are not needed. Also add comments on all barriers. Signed-off-by: Jason Low --- include/linux/mcs_spinlock.h | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h index 96f14299..93d445d 100644 --- a/include/linux/mcs_spinlock.h +++ b/include/linux/mcs_spinlock.h @@ -36,16 +36,19 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node) node->locked = 0; node->next = NULL; + /* xchg() provides a memory barrier */ prev = xchg(lock, node); if (likely(prev == NULL)) { /* Lock acquired */ return; } ACCESS_ONCE(prev->next) = node; - smp_wmb(); /* Wait until the lock holder passes the lock down */ while (!ACCESS_ONCE(node->locked)) arch_mutex_cpu_relax(); + + /* Make sure subsequent operations happen after the lock is acquired */ + smp_rmb(); } /* @@ -58,6 +61,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod if (likely(!next)) { /* +* cmpxchg() provides a memory barrier. * Release the lock by setting it to NULL */ if (likely(cmpxchg(lock, node, NULL) == node)) @@ -65,9 +69,14 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod /* Wait until the next pointer is set */ while (!(next = ACCESS_ONCE(node->next))) arch_mutex_cpu_relax(); + } else { + /* +* Make sure all operations within the critical section +* happen before the lock is released. +*/ + smp_wmb(); } ACCESS_ONCE(next->locked) = 1; - smp_wmb(); } #endif /* __LINUX_MCS_SPINLOCK_H */ -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock
Cmpxchg will cause the cacheline bouning when do the value checking, that cause scalability issue in a large machine (like a 80 core box). So a lock pre-read can relief this contention. Signed-off-by: Alex Shi --- include/asm-generic/rwsem.h |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h index bb1e2cd..5ba80e7 100644 --- a/include/asm-generic/rwsem.h +++ b/include/asm-generic/rwsem.h @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem) static inline int __down_write_trylock(struct rw_semaphore *sem) { - long tmp; + if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE)) + return 0; - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, - RWSEM_ACTIVE_WRITE_BIAS); - return tmp == RWSEM_UNLOCKED_VALUE; + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE; } /* -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
We will need the MCS lock code for doing optimistic spinning for rwsem. Extracting the MCS code from mutex.c and put into its own file allow us to reuse this code easily for rwsem. Reviewed-by: Ingo Molnar Reviewed-by: Peter Zijlstra Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/mcs_spinlock.h | 64 ++ include/linux/mutex.h|5 ++- kernel/mutex.c | 60 -- 3 files changed, 74 insertions(+), 55 deletions(-) create mode 100644 include/linux/mcs_spinlock.h diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h new file mode 100644 index 000..b5de3b0 --- /dev/null +++ b/include/linux/mcs_spinlock.h @@ -0,0 +1,64 @@ +/* + * MCS lock defines + * + * This file contains the main data structure and API definitions of MCS lock. + * + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock + * with the desirable properties of being fair, and with each cpu trying + * to acquire the lock spinning on a local variable. + * It avoids expensive cache bouncings that common test-and-set spin-lock + * implementations incur. + */ +#ifndef __LINUX_MCS_SPINLOCK_H +#define __LINUX_MCS_SPINLOCK_H + +struct mcs_spinlock { + struct mcs_spinlock *next; + int locked; /* 1 if lock acquired */ +}; + +/* + * We don't inline mcs_spin_lock() so that perf can correctly account for the + * time spent in this lock function. + */ +static noinline +void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node) +{ + struct mcs_spinlock *prev; + + /* Init node */ + node->locked = 0; + node->next = NULL; + + prev = xchg(lock, node); + if (likely(prev == NULL)) { + /* Lock acquired */ + node->locked = 1; + return; + } + ACCESS_ONCE(prev->next) = node; + smp_wmb(); + /* Wait until the lock holder passes the lock down */ + while (!ACCESS_ONCE(node->locked)) + arch_mutex_cpu_relax(); +} + +static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node) +{ + struct mcs_spinlock *next = ACCESS_ONCE(node->next); + + if (likely(!next)) { + /* +* Release the lock by setting it to NULL +*/ + if (cmpxchg(lock, node, NULL) == node) + return; + /* Wait until the next pointer is set */ + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); + } + ACCESS_ONCE(next->locked) = 1; + smp_wmb(); +} + +#endif /* __LINUX_MCS_SPINLOCK_H */ diff --git a/include/linux/mutex.h b/include/linux/mutex.h index ccd4260..e6eaeea 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -46,6 +46,7 @@ * - detects multi-task circular deadlocks and prints out all affected * locks and tasks (and only those tasks) */ +struct mcs_spinlock; struct mutex { /* 1: unlocked, 0: locked, negative: locked, possible waiters */ atomic_tcount; @@ -55,7 +56,7 @@ struct mutex { struct task_struct *owner; #endif #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - void*spin_mlock;/* Spinner MCS lock */ + struct mcs_spinlock *mcs_lock; /* Spinner MCS lock */ #endif #ifdef CONFIG_DEBUG_MUTEXES const char *name; @@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); #define arch_mutex_cpu_relax() cpu_relax() #endif -#endif +#endif /* __LINUX_MUTEX_H */ diff --git a/kernel/mutex.c b/kernel/mutex.c index 6d647ae..4640731 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -25,6 +25,7 @@ #include #include #include +#include /* * In the DEBUG case we are using the "NULL fastpath" for mutexes, @@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key) INIT_LIST_HEAD(&lock->wait_list); mutex_clear_owner(lock); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - lock->spin_mlock = NULL; + lock->mcs_lock = NULL; #endif debug_mutex_init(lock, name, key); @@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock); * more or less simultaneously, the spinners need to acquire a MCS lock * first before spinning on the owner field. * - * We don't inline mspin_lock() so that perf can correctly account for the - * time spent in this lock function. */ -struct mspin_node { - struct mspin_node *next ; - int locked; /* 1 if lock acquired */ -}; -#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock)) - -static noinline -void mspin_lock(struct mspin_node **lock, struct mspin_node *node) -{ - struct mspin_node *prev; - - /* Init node */ - node->
[PATCH v8 4/9] rwsem/wake: check lock before do atomic update
Atomic update lock and roll back will cause cache bouncing in large machine. A lock status pre-read can relieve this problem Suggested-by: Davidlohr bueso Suggested-by: Tim Chen Signed-off-by: Alex Shi --- lib/rwsem.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index a8055cf..1d6e6e8 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) struct rwsem_waiter *waiter; struct task_struct *tsk; struct list_head *next; - long oldcount, woken, loop, adjustment; + long woken, loop, adjustment; waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); if (waiter->type == RWSEM_WAITING_FOR_WRITE) { @@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; while (1) { + long oldcount; + + /* A writer stole the lock. */ + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) + return sem; + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; if (likely(oldcount >= RWSEM_WAITING_BIAS)) -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake
That make code simple and more readable Signed-off-by: Alex Shi --- lib/rwsem.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 42f1b1a..a8055cf 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) adjustment = 0; if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; - try_reader_grant: - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { - /* A writer stole the lock. Undo our reader grant. */ + while (1) { + oldcount = rwsem_atomic_update(adjustment, sem) + - adjustment; + if (likely(oldcount >= RWSEM_WAITING_BIAS)) + break; + +/* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) return sem; /* Last active locker left. Retry waking readers. */ - goto try_reader_grant; } } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 0/9] rwsem performance optimizations
For version 8 of the patchset, we included the patch from Waiman to streamline wakeup operations and also optimize the MCS lock used in rwsem and mutex. In this patchset, we introduce three categories of optimizations to read write semaphore. The first four patches from Alex Shi reduce cache bouncing of the sem->count field by doing a pre-read of the sem->count and avoid cmpxchg if possible. The next four patches from Tim, Davidlohr and Jason introduce optimistic spinning logic similar to that in the mutex code for the writer lock acquisition of rwsem. This addresses the general 'mutexes out perform writer-rwsems' situations that has been seen in more than one case. Users now need not worry about performance issues when choosing between these two locking mechanisms. We have also factored out the MCS lock originally in the mutex code into its own file, and performed micro optimizations and corrected the memory barriers so it could be used for general lock/unlock of critical sections. The last patch from Waiman help to streamline the wake up operation by avoiding multiple threads all doing wakeup operations when only one wakeup thread is enough. This significantly reduced lock contentions from multiple wakeup threads. Tim got the following improvement for exim mail server workload on 40 core system: Alex+Tim's patchset: +4.8% Alex+Tim+Waiman's patchset:+5.3% Without these optimizations, Davidlohr Bueso saw a -8% regression to aim7's shared and high_systime workloads when he switched i_mmap_mutex to rwsem. Tests were on 8 socket 80 cores system. With Alex and Tim's patches, he got significant improvements to the aim7 suite instead of regressions: alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%), shared (+18.4%) and short (+6.3%). More Aim7 numbers will be posted when Davidlohr has a chance to test the complete patchset including Waiman's patch. Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney for helping to review this patchset. Tim Changelog: v8: 1. Added Waiman's patch to avoid multiple wakeup thread lock contention. 2. Micro-optimizations of MCS lock. 3. Correct the barriers of MCS lock to prevent critical sections from leaking. v7: 1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields with mcs prefix. 2. Properly define type of *mcs_lock field instead of leaving it as *void. 3. Added breif explanation of mcs lock. v6: 1. Fix missing mcslock.h file. 2. Fix various code style issues. v5: 1. Try optimistic spinning before we put the writer on the wait queue to avoid bottlenecking at wait queue. This provides 5% boost to exim workload and between 2% to 8% boost to aim7. 2. Put MCS locking code into its own mcslock.h file for better reuse between mutex.c and rwsem.c 3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the operations default per Ingo's suggestions. v4: 1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner 2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option v3: 1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner. 2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig v2: 1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed a bug referencing &sem->count when sem->count is intended. 2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner. the option to be on for more seasoning but can be turned off should it be detrimental. 3. Various patch comments update Alex Shi (4): rwsem: check the lock before cpmxchg in down_write_trylock rwsem: remove 'out' label in do_wake rwsem: remove try_reader_grant label do_wake rwsem/wake: check lock before do atomic update Jason Low (2): MCS Lock: optimizations and extra comments MCS Lock: Barrier corrections Tim Chen (2): MCS Lock: Restructure the MCS lock defines and locking code into its own file rwsem: do optimistic spinning for writer lock acquisition Waiman Long (1): rwsem: reduce spinlock contention in wakeup code path include/asm-generic/rwsem.h |8 +- include/linux/mcs_spinlock.h | 82 ++ include/linux/mutex.h|5 +- include/linux/rwsem.h|9 ++- kernel/mutex.c | 60 +- kernel/rwsem.c | 19 +++- lib/rwsem.c | 255 +- 7 files changed, 349 insertions(+), 89 deletions(-) create mode 100644 include/linux/mcs_spinlock.h -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 6/9] MCS Lock: optimizations and extra comments
Remove unnecessary operation and make the cmpxchg(lock, node, NULL) == node check in mcs_spin_unlock() likely() as it is likely that a race did not occur most of the time. Also add in more comments describing how the local node is used in MCS locks. Signed-off-by: Jason Low --- include/linux/mcs_spinlock.h | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h index b5de3b0..96f14299 100644 --- a/include/linux/mcs_spinlock.h +++ b/include/linux/mcs_spinlock.h @@ -18,6 +18,12 @@ struct mcs_spinlock { }; /* + * In order to acquire the lock, the caller should declare a local node and + * pass a reference of the node to this function in addition to the lock. + * If the lock has already been acquired, then this will proceed to spin + * on this node->locked until the previous lock holder sets the node->locked + * in mcs_spin_unlock(). + * * We don't inline mcs_spin_lock() so that perf can correctly account for the * time spent in this lock function. */ @@ -33,7 +39,6 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node) prev = xchg(lock, node); if (likely(prev == NULL)) { /* Lock acquired */ - node->locked = 1; return; } ACCESS_ONCE(prev->next) = node; @@ -43,6 +48,10 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node) arch_mutex_cpu_relax(); } +/* + * Releases the lock. The caller should pass in the corresponding node that + * was used to acquire the lock. + */ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node) { struct mcs_spinlock *next = ACCESS_ONCE(node->next); @@ -51,7 +60,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *nod /* * Release the lock by setting it to NULL */ - if (cmpxchg(lock, node, NULL) == node) + if (likely(cmpxchg(lock, node, NULL) == node)) return; /* Wait until the next pointer is set */ while (!(next = ACCESS_ONCE(node->next))) -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition
We want to add optimistic spinning to rwsems because the writer rwsem does not perform as well as mutexes. Tim noticed that for exim (mail server) workloads, when reverting commit 4fc3f1d6 and Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some aim7 workloads. We've noticed that the biggest difference is when we fail to acquire a mutex in the fastpath, optimistic spinning comes in to play and we can avoid a large amount of unnecessary sleeping and overhead of moving tasks in and out of wait queue. Allowing optimistic spinning before putting the writer on the wait queue reduces wait queue contention and provided greater chance for the rwsem to get acquired. With these changes, rwsem is on par with mutex. Reviewed-by: Ingo Molnar Reviewed-by: Peter Zijlstra Reviewed-by: Peter Hurley Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/rwsem.h |7 ++- kernel/rwsem.c| 19 +- lib/rwsem.c | 201 - 3 files changed, 206 insertions(+), 21 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 0616ffe..aba7920 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -22,10 +22,13 @@ struct rw_semaphore; #include /* use a generic implementation */ #else /* All arch specific implementations share the same struct */ +struct mcs_spinlock; struct rw_semaphore { longcount; raw_spinlock_t wait_lock; struct list_headwait_list; + struct task_struct *owner; /* write owner */ + struct mcs_spinlock *mcs_lock; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif @@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) #define __RWSEM_INITIALIZER(name) \ { RWSEM_UNLOCKED_VALUE, \ __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ - LIST_HEAD_INIT((name).wait_list) \ + LIST_HEAD_INIT((name).wait_list), \ + NULL, \ + NULL \ __RWSEM_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(name) \ diff --git a/kernel/rwsem.c b/kernel/rwsem.c index cfff143..d74d1c9 100644 --- a/kernel/rwsem.c +++ b/kernel/rwsem.c @@ -12,6 +12,16 @@ #include +static inline void rwsem_set_owner(struct rw_semaphore *sem) +{ + sem->owner = current; +} + +static inline void rwsem_clear_owner(struct rw_semaphore *sem) +{ + sem->owner = NULL; +} + /* * lock for reading */ @@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem) rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write); @@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem) { int ret = __down_write_trylock(sem); - if (ret == 1) + if (ret == 1) { rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_); + rwsem_set_owner(sem); + } return ret; } @@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem) rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(up_write); @@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem) * dependency. */ __downgrade_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(downgrade_write); @@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest) rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(_down_write_nest_lock); @@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass) rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write_nested); diff --git a/lib/rwsem.c b/lib/rwsem.c index 1d6e6e8..cc3b33e 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include /* * Initialize an rwsem: @@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, sem->count = RWSEM_UNLOCKED_VALUE; raw_spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); + sem->owner = NULL; + sem->mcs_lock = NULL; } EXPORT_SYMBOL(__init_rwsem); @@ -194,14 +198,177 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) return sem; } +static inline int rwsem_try_write_lock(long count,
[PATCH v8 2/9] rwsem: remove 'out' label in do_wake
That make code simple and more readable. Signed-off-by: Alex Shi --- lib/rwsem.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 19c5fa9..42f1b1a 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) * will block as they will notice the queued writer. */ wake_up_process(waiter->task); - goto out; + return sem; } /* Writers might steal the lock before we grant it to the next reader. @@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) /* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) - goto out; + return sem; /* Last active locker left. Retry waking readers. */ goto try_reader_grant; } @@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) sem->wait_list.next = next; next->prev = &sem->wait_list; - out: return sem; } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path
With the 3.12-rc2 kernel, there is sizable spinlock contention on the rwsem wakeup code path when running AIM7's high_systime workload on a 8-socket 80-core DL980 (HT off) as reported by perf: 7.64% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave |--41.77%-- rwsem_wake 1.61% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq |--92.37%-- rwsem_down_write_failed That was 4.7% of recorded CPU cycles. On a large NUMA machine, it is entirely possible that a fairly large number of threads are queuing up in the ticket spinlock queue to do the wakeup operation. In fact, only one will be needed. This patch tries to reduce spinlock contention by doing just that. A new wakeup field is added to the rwsem structure. This field is set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a thread is pending to do the wakeup call. It is cleared on exit from those functions. There is no size increase in 64-bit systems and a 4 bytes size increase in 32-bit systems. By checking if the wakeup flag is set, a thread can exit rwsem_wake() immediately if another thread is pending to do the wakeup instead of waiting to get the spinlock and find out that nothing need to be done. The setting of the wakeup flag may not be visible on all processors in some architectures. However, this won't affect program correctness. The clearing of the wakeup flag before spin_unlock and other barrier-type atomic instructions will ensure that it is visible to all processors. With this patch alone, the performance improvement on jobs per minute (JPM) of an sample run of the high_systime workload (at 1500 users) on DL980 was as follows: HT JPM w/o patch JPM with patch % Change -- - -- off148265 170896+15.3% on 140078 159319+13.7% The new perf profile (HT off) was as follows: 2.96% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave |--0.94%-- rwsem_wake 1.00% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq |--88.70%-- rwsem_down_write_failed Together with the rest of rwsem patches in the series, the JPM number (HT off) jumps to 195041 which is 32% better. Signed-off-by: Waiman Long --- include/linux/rwsem.h |2 ++ lib/rwsem.c | 29 + 2 files changed, 31 insertions(+), 0 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index aba7920..29314d3 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -26,6 +26,7 @@ struct mcs_spinlock; struct rw_semaphore { longcount; raw_spinlock_t wait_lock; + int wakeup; /* Waking-up in progress flag */ struct list_headwait_list; struct task_struct *owner; /* write owner */ struct mcs_spinlock *mcs_lock; @@ -61,6 +62,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) #define __RWSEM_INITIALIZER(name) \ { RWSEM_UNLOCKED_VALUE, \ __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ + 0,\ LIST_HEAD_INIT((name).wait_list), \ NULL, \ NULL \ diff --git a/lib/rwsem.c b/lib/rwsem.c index cc3b33e..1adee01 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -27,6 +27,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, lockdep_init_map(&sem->dep_map, name, key, 0); #endif sem->count = RWSEM_UNLOCKED_VALUE; + sem->wakeup = 0; raw_spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); sem->owner = NULL; @@ -70,6 +71,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) struct list_head *next; long woken, loop, adjustment; + sem->wakeup = 1;/* Waking up in progress */ waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); if (waiter->type == RWSEM_WAITING_FOR_WRITE) { if (wake_type == RWSEM_WAKE_ANY) @@ -79,6 +81,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) * will block as they will notice the queued writer. */ wake_up_process(waiter->task); + sem->wakeup = 0;/* Wakeup done */ return sem; } @@ -87,6 +90,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) * so we can bail out early if a writer stole the lock. */ adjustment = 0; + sem->wakeup = 0; if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; while (1) { @@ -426,11 +430,36 @@ struct rw_semaphore *rwsem_wake(st
[PATCH 1/2] fs/superblock: Unregister sb shrinker before ->kill_sb()
Dave, I've created the two patches we've discussed on the mail thread regarding avoiding grab_super_passive during memory reclaim. Tim From: Dave Chinner We will like to unregister the sb shrinker before ->kill_sb(). This will allow cached objects to be counted without call to grab_super_passive() to update ref count on sb. We want to avoid locking during memory reclamation especially when we are skipping the memory reclaim when we are out of cached objects. This is safe because grab_super_passive does a try-lock on the sb->s_umount now, and so if we are in the unmount process, it won't ever block. That means what used to be a deadlock and races we were avoiding by using grab_super_passive() is now: shrinkerumount down_read(shrinker_rwsem) down_write(sb->s_umount) shrinker_unregister down_write(shrinker_rwsem) grab_super_passive(sb) down_read_trylock(sb->s_umount) up_read(shrinker_rwsem) up_write(shrinker_rwsem) ->kill_sb() So it is safe to deregister the shrinker before ->kill_sb(). Signed-off-by: Tim Chen --- fs/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/super.c b/fs/super.c index 73d0952..b724f35 100644 --- a/fs/super.c +++ b/fs/super.c @@ -324,10 +324,10 @@ void deactivate_locked_super(struct super_block *s) struct file_system_type *fs = s->s_type; if (atomic_dec_and_test(&s->s_active)) { cleancache_invalidate_fs(s); + unregister_shrinker(&s->s_shrink); fs->kill_sb(s); /* caches are now gone, we can safely kill the shrinker now */ - unregister_shrinker(&s->s_shrink); list_lru_destroy(&s->s_dentry_lru); list_lru_destroy(&s->s_inode_lru); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid useless inodes and dentries reclamation
On Fri, 2013-09-06 at 10:55 +1000, Dave Chinner wrote: > On Tue, Sep 03, 2013 at 11:38:27AM -0700, Tim Chen wrote: > > On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote: > > > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote: > > > > > > > > > > > > Signed-off-by: Tim Chen > > > > --- > > > > diff --git a/fs/super.c b/fs/super.c > > > > index 73d0952..4df1fab 100644 > > > > --- a/fs/super.c > > > > +++ b/fs/super.c > > > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct > > > > shrinker *shrink, > > > > > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > > > > > - if (!grab_super_passive(sb)) > > > > - return 0; > > > > - > > > > > > I think the function needs a comment explaining why we aren't > > > grabbing the sb here, otherwise people are going to read the code > > > and ask why it's different to the scanning callout. > > > > > > > if (sb->s_op && sb->s_op->nr_cached_objects) > > > > total_objects = sb->s_op->nr_cached_objects(sb, > > > > sc->nid); > > > > > > > Yes, those comments are needed. > > I also need to remove the corresponding > > drop_super(sb); > > > > So probably something like: > > > > --- > > diff --git a/fs/super.c b/fs/super.c > > index 73d0952..7b5a6e5 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker > > *shrink, > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > - if (!grab_super_passive(sb)) > > - return 0; > > - > > + /* > > +* Don't call grab_super_passive as it is a potential > > +* scalability bottleneck. The counts could get updated > > +* between super_cache_count and super_cache_scan anyway. > > +* Call to super_cache_count with shrinker_rwsem held > > +* ensures the safety of call to list_lru_count_node() and > > +* s_op->nr_cached_objects(). > > +*/ > > Well, that's not true of s_op->nr_cached_objects() right now. It's > only going to be true if the shrinker deregistration is moved before > ->kill_sb() > > > > Let me have a bit more of a think about this - the solution may > > > simply be unregistering the shrinker before we call ->kill_sb() so > > > the shrinker can't get called while we are tearing down the fs. > > > First, though, I need to go back and remind myself of why I put that > > > after ->kill_sb() in the first place. > > > > Seems very reasonable as I haven't found a case where the shrinker > > is touched in ->kill_sb() yet. It looks like unregistering the > > shrinker before ->kill_sb() should be okay. > > Having looked at it some more, I have to agree. I think the original > reason for unregistering the shrinker there was to avoid problems > with locking - the shrinker callouts are run holding the > shrinker_rwsem in read mode, and then we lock the sb->s_umount in > read mount. In the unmount case, we currently take the sb->s_umount > lock in write mode (thereby locking out the shrinker) but we drop it > before deregistering the shrinker and so there is no inverted > locking order. > > The thing is, grab_super_passive does a try-lock on the sb->s_umount > now, and so if we are in the unmount process, it won't ever block. > That means what used to be a deadlock and races we were avoiding > by using grab_super_passive() is now: > > shrinkerumount > > down_read(shrinker_rwsem) > down_write(sb->s_umount) > shrinker_unregister > down_write(shrinker_rwsem) > > grab_super_passive(sb) > down_read_trylock(sb->s_umount) > > > > > up_read(shrinker_rwsem) > > > up_write(shrinker_rwsem) > ->kill_sb() > > > And so it appears to be safe to deregister the shrinker before > ->kill_sb(). > > Can you do this as two patches? The first moves the shrinker > deregistration to before ->kill_sb(), then second is the above patch > that drops the grab-super_passive() calls from the ->count_objects > function? I've sent the patches on a separate mail thread. Please add your sign-off if the patches look okay. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] fs/superblock: Avoid locking counting inodes and dentries before reclaiming them
We remove the call to grab_super_passive in call to super_cache_count. This becomes a scalability bottleneck as multiple threads are trying to do memory reclamation, e.g. when we are doing large amount of file read and page cache is under pressure. The cached objects quickly got reclaimed down to 0 and we are aborting the cache_scan() reclaim. But counting creates a log jam acquiring the sb_lock. We are holding the shrinker_rwsem which ensures the safety of call to list_lru_count_node() and s_op->nr_cached_objects. The shrinker is unregistered now before ->kill_sb() so the operation is safe when we are doing unmount. Signed-off-by: Tim Chen --- fs/super.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/super.c b/fs/super.c index b724f35..b5c9fdf 100644 --- a/fs/super.c +++ b/fs/super.c @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); - if (!grab_super_passive(sb)) - return 0; - + /* +* Don't call grab_super_passive as it is a potential +* scalability bottleneck. The counts could get updated +* between super_cache_count and super_cache_scan anyway. +* Call to super_cache_count with shrinker_rwsem held +* ensures the safety of call to list_lru_count_node() and +* s_op->nr_cached_objects(). +*/ if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); @@ -125,7 +130,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, sc->nid); total_objects = vfs_pressure_ratio(total_objects); - drop_super(sb); return total_objects; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > For version 8 of the patchset, we included the patch from Waiman to > > streamline wakeup operations and also optimize the MCS lock used in > > rwsem and mutex. > > I'd be feeling a lot easier about this patch series if you also had > performance figures that show how mmap_sem is affected. > > These: > > > Tim got the following improvement for exim mail server > > workload on 40 core system: > > > > Alex+Tim's patchset: +4.8% > > Alex+Tim+Waiman's patchset:+5.3% > > appear to be mostly related to the anon_vma->rwsem. But once that lock is > changed to an rwlock_t, this measurement falls away. > > Peter Zijlstra suggested the following testcase: > > ===> > In fact, try something like this from userspace: > > n-threads: > > pthread_mutex_lock(&mutex); > foo = mmap(); > pthread_mutex_lock(&mutex); > > /* work */ > > pthread_mutex_unlock(&mutex); > munma(foo); > pthread_mutex_unlock(&mutex); > > vs > > n-threads: > > foo = mmap(); > /* work */ > munmap(foo); Ingo, I ran the vanilla kernel, the kernel with all rwsem patches and the kernel with all patches except the optimistic spin one. I am listing two presentations of the data. Please note that there is about 5% run-run variation. % change in performance vs vanilla kernel #threadsall without optspin mmap only 1 1.9%1.6% 5 43.8% 2.6% 10 22.7% -3.0% 20 -12.0% -4.5% 40 -26.9% -2.0% mmap with mutex acquisition 1 -2.1% -3.0% 5 -1.9% 1.0% 10 4.2%12.5% 20 -4.1% 0.6% 40 -2.8% -1.9% The optimistic spin case does very well at low to moderate contentions, but worse when there are very heavy contentions for the pure mmap case. For the case with pthread mutex, there's not much change from vanilla kernel. % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all without optspin 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% In general, vanilla and no-optspin case perform better with pthread-mutex. For the case with optspin, mmap with pthread-mutex is worse at low to moderate contention and better at high contention. Tim > > I've had reports that the former was significantly faster than the > latter. > <=== > > this could be put into a standalone testcase, or you could add it as a new > subcommand of 'perf bench', which already has some pthread code, see for > example in tools/perf/bench/sched-messaging.c. Adding: > >perf bench mm threads > > or so would be a natural thing to have. > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote: > On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote: > > We will need the MCS lock code for doing optimistic spinning for rwsem. > > Extracting the MCS code from mutex.c and put into its own file allow us > > to reuse this code easily for rwsem. > > > > Reviewed-by: Ingo Molnar > > Reviewed-by: Peter Zijlstra > > Signed-off-by: Tim Chen > > Signed-off-by: Davidlohr Bueso > > --- > > include/linux/mcs_spinlock.h | 64 > > ++ > > include/linux/mutex.h|5 ++- > > kernel/mutex.c | 60 -- > > 3 files changed, 74 insertions(+), 55 deletions(-) > > create mode 100644 include/linux/mcs_spinlock.h > > > > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h > > new file mode 100644 > > index 000..b5de3b0 > > --- /dev/null > > +++ b/include/linux/mcs_spinlock.h > > @@ -0,0 +1,64 @@ > > +/* > > + * MCS lock defines > > + * > > + * This file contains the main data structure and API definitions of MCS > > lock. > > + * > > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple > > spin-lock > > + * with the desirable properties of being fair, and with each cpu trying > > + * to acquire the lock spinning on a local variable. > > + * It avoids expensive cache bouncings that common test-and-set spin-lock > > + * implementations incur. > > + */ > > nitpick: > > I believe you need > > +#include > > here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined > (arch/s390 is one case) Probably +#include should be added instead? It defines arch_mutex_cpu_relax when there's no architecture specific version. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 0/9] rwsem performance optimizations
On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > Ingo, > > > > I ran the vanilla kernel, the kernel with all rwsem patches and the > > kernel with all patches except the optimistic spin one. I am listing > > two presentations of the data. Please note that there is about 5% > > run-run variation. > > > > % change in performance vs vanilla kernel > > #threadsall without optspin > > mmap only > > 1 1.9%1.6% > > 5 43.8% 2.6% > > 10 22.7% -3.0% > > 20 -12.0% -4.5% > > 40 -26.9% -2.0% > > mmap with mutex acquisition > > 1 -2.1% -3.0% > > 5 -1.9% 1.0% > > 10 4.2%12.5% > > 20 -4.1% 0.6% > > 40 -2.8% -1.9% > > Silly question: how do the two methods of starting N threads compare to > each other? They both started N pthreads and run for a fixed time. The throughput of pure mmap with mutex is below vs pure mmap is below: % change in performance of the mmap with pthread-mutex vs pure mmap #threadsvanilla all rwsem without optspin patches 1 3.0%-1.0% -1.7% 5 7.2%-26.8% 5.5% 10 5.2%-10.6% 22.1% 20 6.8%16.4% 12.5% 40 -0.2% 32.7% 0.0% So with mutex, the vanilla kernel and the one without optspin both run faster. This is consistent with what Peter reported. With optspin, the picture is more mixed, with lower throughput at low to moderate number of threads and higher throughput with high number of threads. > Do they have identical runtimes? Yes, they both have identical runtimes. I look at the number of mmap and munmap operations I could push through. > I think PeterZ's point was > that the pthread_mutex case, despite adding extra serialization, actually > runs faster in some circumstances. Yes, I also see the pthread mutex run faster for the vanilla kernel from the data above. > > Also, mind posting the testcase? What 'work' do the threads do - clear > some memory area? The test case do simple mmap and munmap 1MB memory per iteration. > How big is the memory area? 1MB The two cases are created as: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; void testcase(unsigned long long *iterations) { while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); munmap(c, MEMSIZE); (*iterations)++; } } and adding mutex to serialize: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = "Anonymous memory mmap/munmap of 1MB with mutex"; pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; void testcase(unsigned long long *iterations) { while (1) { pthread_mutex_lock(&mutex); char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); munmap(c, MEMSIZE); pthread_mutex_unlock(&mutex); (*iterations)++; } } and run as a pthread. > > I'd expect this to be about large enough mmap()s showing page fault > processing to be mmap_sem bound and the serialization via pthread_mutex() > sets up a 'train' of threads in one case, while the parallel startup would > run into the mmap_sem in the regular case. > > So I'd expect this to be a rather sensitive workload and you'd have to > actively engineer it to hit the effect PeterZ mentioned. I could imagine > MPI workloads to run into such patterns - but not deterministically. > > Only once you've convinced yourself that you are hitting that kind of > effect reliably on the vanilla kernel, could/should the effects of an > improved rwsem implementation be measured. > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid useless inodes and dentries reclamation
> > Signed-off-by: Tim Chen > > --- > > fs/super.c | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/fs/super.c b/fs/super.c > > index 68307c0..70fa26c 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = { > > * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we > > * take a passive reference to the superblock to avoid this from occurring. > > */ > > +#define SB_CACHE_LOW 5 > > static int prune_super(struct shrinker *shrink, struct shrink_control *sc) > > { > > struct super_block *sb; > > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct > > shrink_control *sc) > > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS)) > > return -1; > > > > + /* > > +* Don't prune if we have few cached objects to reclaim to > > +* avoid useless sb_lock contention > > +*/ > > + if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW) > > + return -1; > > Those counters no longer exist in the current mmotm tree and the > shrinker infrastructure is somewhat different, so this patch isn't > the right way to solve this problem. These changes in mmotm tree do complicate solutions for this scalability issue. > > Given that superblock LRUs and shrinkers in mmotm are node aware, > there may even be more pressure on the sblock in such a workload. I > think the right way to deal with this is to give the shrinker itself > a "minimum call count" so that we can avoid even attempting to > shrink caches that does have enough entries in them to be worthwhile > shrinking. By "minimum call count", do you mean tracking the number of free entries per node in the shrinker, and invoking shrinker only when the number of free entries exceed "minimum call count"? There is some cost in list_lru_count_node to get the free entries, as we need to acquire the node's lru lock. Alternatively, we can set a special flag/node by list_add or list_del when count goes above/below a threshold and invoke shrinker based on this flag. Or do you mean that if we do not reap any memory in a shrink operation, we do a certain number of backoffs of shrink operation till the "minimum call count" is reached? > > That said, the memcg guys have been saying that even small numbers > of items per cache can be meaningful in terms of memory reclaim > (e.g. when there are lots of memcgs) then such a threshold might > only be appropriate for caches that are not memcg controlled. I've done some experiment with the CACHE thresholds. Even setting the threshold at 0 (i.e. there're no free entries) remove almost all the needless contentions. That should make the memcg guys happy by not holding any extra free entries. > In > that case, handling it in the shrinker infrastructure itself is a > much better idea than hacking thresholds into individual shrinker > callouts. Currently the problem is mostly with the sb shrinker due to the sb_lock. If we can have a general solution, that will be even better. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid useless inodes and dentries reclamation
On Fri, 2013-08-30 at 11:40 +1000, Dave Chinner wrote: > > The new shrinker infrastructure has a ->count_objects() callout to > specifically return the number of objects in the cache. > shrink_slab_node() can check that return value against the "minimum > call count" and determine whether it needs to call ->scan_objects() > at all. > > Actually, the shrinker already behaves like this with the batch_size > variable - the shrinker has to be asking for more items to be > scanned than the batch size. That means the problem is that counting > callouts are causing the problem, not the scanning callouts. > > With the new code in the mmotm tree, for counting purposes we > probably don't need to grab a passive superblock reference at all - > the superblock and LRUs are guaranteed to be valid if the shrinker > is currently running, but we don't really care if the superblock is > being shutdown and the values that come back are invalid because the > ->scan_objects() callout will fail to grab the superblock to do > anything with the calculated values. If that's the case, then we should remove grab_super_passive from the super_cache_count code. That should remove the bottleneck in reclamation. Thanks for your detailed explanation. Tim Signed-off-by: Tim Chen --- diff --git a/fs/super.c b/fs/super.c index 73d0952..4df1fab 100644 --- a/fs/super.c +++ b/fs/super.c @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); - if (!grab_super_passive(sb)) - return 0; - if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Avoid useless inodes and dentries reclamation
On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote: > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote: > > > > > > Signed-off-by: Tim Chen > > --- > > diff --git a/fs/super.c b/fs/super.c > > index 73d0952..4df1fab 100644 > > --- a/fs/super.c > > +++ b/fs/super.c > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker > > *shrink, > > > > sb = container_of(shrink, struct super_block, s_shrink); > > > > - if (!grab_super_passive(sb)) > > - return 0; > > - > > I think the function needs a comment explaining why we aren't > grabbing the sb here, otherwise people are going to read the code > and ask why it's different to the scanning callout. > > > if (sb->s_op && sb->s_op->nr_cached_objects) > > total_objects = sb->s_op->nr_cached_objects(sb, > > sc->nid); > Yes, those comments are needed. I also need to remove the corresponding drop_super(sb); So probably something like: --- diff --git a/fs/super.c b/fs/super.c index 73d0952..7b5a6e5 100644 --- a/fs/super.c +++ b/fs/super.c @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); - if (!grab_super_passive(sb)) - return 0; - + /* +* Don't call grab_super_passive as it is a potential +* scalability bottleneck. The counts could get updated +* between super_cache_count and super_cache_scan anyway. +* Call to super_cache_count with shrinker_rwsem held +* ensures the safety of call to list_lru_count_node() and +* s_op->nr_cached_objects(). +*/ if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); @@ -125,7 +130,6 @@ static unsigned long super_cache_count(struct shrinker *shrink, sc->nid); total_objects = vfs_pressure_ratio(total_objects); - drop_super(sb); return total_objects; } > But seeing this triggered further thought on my part. Being called > during unmount means that ->nr_cached_objects implementations need > to be robust against unmount tearing down private filesystem > structures. Right now, grab_super_passive() protects us from that > because it won't be able to get the sb->s_umount lock while > generic_shutdown_super() is doing it's work. > > IOWs, the superblock based shrinker operations are safe because the > structures don't get torn down until after the shrinker is > unregistered. That's not true for the structures that > ->nr_cached_objects() use: ->put_super() tears them down before the > shrinker is unregistered and only grab_super_passive() protects us > from thay. > > Let me have a bit more of a think about this - the solution may > simply be unregistering the shrinker before we call ->kill_sb() so > the shrinker can't get called while we are tearing down the fs. > First, though, I need to go back and remind myself of why I put that > after ->kill_sb() in the first place. Seems very reasonable as I haven't found a case where the shrinker is touched in ->kill_sb() yet. It looks like unregistering the shrinker before ->kill_sb() should be okay. > If we unregister the shrinker > before ->kill_sb is called, then we can probably get rid of > grab_super_passive() in both shrinker callouts because they will no > longer need to handle running concurrently with ->kill_sb() > Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Thu, 2013-09-26 at 10:40 +0200, Peter Zijlstra wrote: > On Thu, Sep 26, 2013 at 08:46:29AM +0200, Ingo Molnar wrote: > > > +/* > > > + * MCS lock defines > > > + * > > > + * This file contains the main data structure and API definitions of MCS > > > lock. > > > > A (very) short blurb about what an MCS lock is would be nice here. > > A while back I suggested including a link to something like: > > http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf > > Its a fairly concise write-up of the idea; only 6 pages. The sad part > about linking to the web is that links tend to go dead after a while. Link rot is a problem. If I provide a few details about MCS lock I think people should be able to google for it. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Thu, 2013-09-26 at 14:09 -0700, Jason Low wrote: > On Thu, 2013-09-26 at 13:40 -0700, Davidlohr Bueso wrote: > > On Thu, 2013-09-26 at 13:23 -0700, Jason Low wrote: > > > On Thu, 2013-09-26 at 13:06 -0700, Davidlohr Bueso wrote: > > > > On Thu, 2013-09-26 at 12:27 -0700, Jason Low wrote: > > > > > On Wed, Sep 25, 2013 at 3:10 PM, Tim Chen > > > > > wrote: > > > > > > We will need the MCS lock code for doing optimistic spinning for > > > > > > rwsem. > > > > > > Extracting the MCS code from mutex.c and put into its own file > > > > > > allow us > > > > > > to reuse this code easily for rwsem. > > > > > > > > > > > > Signed-off-by: Tim Chen > > > > > > Signed-off-by: Davidlohr Bueso > > > > > > --- > > > > > > include/linux/mcslock.h | 58 > > > > > > +++ > > > > > > kernel/mutex.c | 58 > > > > > > +- > > > > > > 2 files changed, 65 insertions(+), 51 deletions(-) > > > > > > create mode 100644 include/linux/mcslock.h > > > > > > > > > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > > > > > > new file mode 100644 > > > > > > index 000..20fd3f0 > > > > > > --- /dev/null > > > > > > +++ b/include/linux/mcslock.h > > > > > > @@ -0,0 +1,58 @@ > > > > > > +/* > > > > > > + * MCS lock defines > > > > > > + * > > > > > > + * This file contains the main data structure and API definitions > > > > > > of MCS lock. > > > > > > + */ > > > > > > +#ifndef __LINUX_MCSLOCK_H > > > > > > +#define __LINUX_MCSLOCK_H > > > > > > + > > > > > > +struct mcs_spin_node { > > > > > > + struct mcs_spin_node *next; > > > > > > + int locked; /* 1 if lock acquired */ > > > > > > +}; > > > > > > + > > > > > > +/* > > > > > > + * We don't inline mcs_spin_lock() so that perf can correctly > > > > > > account for the > > > > > > + * time spent in this lock function. > > > > > > + */ > > > > > > +static noinline > > > > > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct > > > > > > mcs_spin_node *node) > > > > > > +{ > > > > > > + struct mcs_spin_node *prev; > > > > > > + > > > > > > + /* Init node */ > > > > > > + node->locked = 0; > > > > > > + node->next = NULL; > > > > > > + > > > > > > + prev = xchg(lock, node); > > > > > > + if (likely(prev == NULL)) { > > > > > > + /* Lock acquired */ > > > > > > + node->locked = 1; > > > > > > > > > > If we don't spin on the local node, is it necessary to set this > > > > > variable? > > > > > > > > I don't follow, the whole idea is to spin on the local variable. > > > > > > If prev == NULL, doesn't that mean it won't proceed to spin on the > > > variable because the lock is already free and we call return? In that > > > case where we directly acquire the lock, I was wondering if it is > > > necessary to set node->locked = 1. > > > > Yes, that's true, but we need to flag the lock as acquired (the node's > > lock is initially set to unlocked), otherwise others trying to acquire > > the lock can spin forever: > > > > /* Wait until the lock holder passes the lock down */ > > while (!ACCESS_ONCE(node->locked)) > > arch_mutex_cpu_relax(); > > > > The ->locked variable in this implementation refers to if the lock is > > acquired, and *not* to if busy-waiting is necessary. > > hmm, others threads acquiring the lock will be spinning on their own > local nodes, not this node's node->locked. And if prev == NULL, the > current thread won't be reading it's node->lock either since we return. > So what other thread is going to be reading this node's node->lock? > > Thanks, > Jason I think setting node->locked = 1 for the prev==NULL case is not necessary functionally, but was done for semantics consistency. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v7 6/6] rwsem: do optimistic spinning for writer lock acquisition
We want to add optimistic spinning to rwsems because the writer rwsem does not perform as well as mutexes. Tim noticed that for exim (mail server) workloads, when reverting commit 4fc3f1d6 and Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some aim7 workloads. We've noticed that the biggest difference is when we fail to acquire a mutex in the fastpath, optimistic spinning comes in to play and we can avoid a large amount of unnecessary sleeping and overhead of moving tasks in and out of wait queue. Allowing optimistic spinning before putting the writer on the wait queue reduces wait queue contention and provided greater chance for the rwsem to get acquired. With these changes, rwsem is on par with mutex. Reviewed-by: Ingo Molnar Reviewed-by: Peter Zijlstra Reviewed-by: Peter Hurley Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/rwsem.h |7 ++- kernel/rwsem.c| 19 +- lib/rwsem.c | 201 - 3 files changed, 206 insertions(+), 21 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 0616ffe..aba7920 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -22,10 +22,13 @@ struct rw_semaphore; #include /* use a generic implementation */ #else /* All arch specific implementations share the same struct */ +struct mcs_spinlock; struct rw_semaphore { longcount; raw_spinlock_t wait_lock; struct list_headwait_list; + struct task_struct *owner; /* write owner */ + struct mcs_spinlock *mcs_lock; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif @@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) #define __RWSEM_INITIALIZER(name) \ { RWSEM_UNLOCKED_VALUE, \ __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ - LIST_HEAD_INIT((name).wait_list) \ + LIST_HEAD_INIT((name).wait_list), \ + NULL, \ + NULL \ __RWSEM_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(name) \ diff --git a/kernel/rwsem.c b/kernel/rwsem.c index cfff143..d74d1c9 100644 --- a/kernel/rwsem.c +++ b/kernel/rwsem.c @@ -12,6 +12,16 @@ #include +static inline void rwsem_set_owner(struct rw_semaphore *sem) +{ + sem->owner = current; +} + +static inline void rwsem_clear_owner(struct rw_semaphore *sem) +{ + sem->owner = NULL; +} + /* * lock for reading */ @@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem) rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write); @@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem) { int ret = __down_write_trylock(sem); - if (ret == 1) + if (ret == 1) { rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_); + rwsem_set_owner(sem); + } return ret; } @@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem) rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(up_write); @@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem) * dependency. */ __downgrade_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(downgrade_write); @@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest) rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(_down_write_nest_lock); @@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass) rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write_nested); diff --git a/lib/rwsem.c b/lib/rwsem.c index 1d6e6e8..cc3b33e 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include /* * Initialize an rwsem: @@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, sem->count = RWSEM_UNLOCKED_VALUE; raw_spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); + sem->owner = NULL; + sem->mcs_lock = NULL; } EXPORT_SYMBOL(__init_rwsem); @@ -194,14 +198,177 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) return sem; } +static inline int rwsem_try_write_lock(long count,
[PATCH v7 0/6] rwsem: performance optimizations
We fixed various style issues for version 7 of this patchset and added brief explanations of the MCS lock code that we put in a separate file. We will like to have it merged if there are no objections. In this patchset, we introduce two categories of optimizations to read write semaphore. The first four patches from Alex Shi reduce cache bouncing of the sem->count field by doing a pre-read of the sem->count and avoid cmpxchg if possible. The last two patches introduce similar optimistic spinning logic as the mutex code for the writer lock acquisition of rwsem. This addresses the general 'mutexes out perform writer-rwsems' situations that has been seen in more than one case. Users now need not worry about performance issues when choosing between these two locking mechanisms. Without these optimizations, Davidlohr Bueso saw a -8% regression to aim7's shared and high_systime workloads when he switched i_mmap_mutex to rwsem. Tests were on 8 socket 80 cores system. With the patchset, he got significant improvements to the aim7 suite instead of regressions: alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%), shared (+18.4%) and short (+6.3%). Tim Chen also got a +5% improvements to exim mail server workload on a 40 core system. Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing this patchset. Regards, Tim Chen Changelog: v7: 1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields with mcs prefix. 2. Properly define type of *mcs_lock field instead of leaving it as *void. 3. Added breif explanation of mcs lock. v6: 1. Fix missing mcslock.h file. 2. Fix various code style issues. v5: 1. Try optimistic spinning before we put the writer on the wait queue to avoid bottlenecking at wait queue. This provides 5% boost to exim workload and between 2% to 8% boost to aim7. 2. Put MCS locking code into its own mcslock.h file for better reuse between mutex.c and rwsem.c 3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the operations default per Ingo's suggestions. v4: 1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner 2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option v3: 1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner. 2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig v2: 1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed a bug referencing &sem->count when sem->count is intended. 2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner. the option to be on for more seasoning but can be turned off should it be detrimental. 3. Various patch comments update Alex Shi (4): rwsem: check the lock before cpmxchg in down_write_trylock rwsem: remove 'out' label in do_wake rwsem: remove try_reader_grant label do_wake rwsem/wake: check lock before do atomic update Tim Chen (2): MCS Lock: Restructure the MCS lock defines and locking code into its own file rwsem: do optimistic spinning for writer lock acquisition include/asm-generic/rwsem.h |8 +- include/linux/mcs_spinlock.h | 64 include/linux/mutex.h|5 +- include/linux/rwsem.h|7 +- kernel/mutex.c | 60 ++-- kernel/rwsem.c | 19 - lib/rwsem.c | 226 - 7 files changed, 300 insertions(+), 89 deletions(-) create mode 100644 include/linux/mcs_spinlock.h -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v7 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
We will need the MCS lock code for doing optimistic spinning for rwsem. Extracting the MCS code from mutex.c and put into its own file allow us to reuse this code easily for rwsem. Reviewed-by: Ingo Molnar Reviewed-by: Peter Zijlstra Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/mcs_spinlock.h | 64 ++ include/linux/mutex.h|5 ++- kernel/mutex.c | 60 -- 3 files changed, 74 insertions(+), 55 deletions(-) create mode 100644 include/linux/mcs_spinlock.h diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h new file mode 100644 index 000..b5de3b0 --- /dev/null +++ b/include/linux/mcs_spinlock.h @@ -0,0 +1,64 @@ +/* + * MCS lock defines + * + * This file contains the main data structure and API definitions of MCS lock. + * + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock + * with the desirable properties of being fair, and with each cpu trying + * to acquire the lock spinning on a local variable. + * It avoids expensive cache bouncings that common test-and-set spin-lock + * implementations incur. + */ +#ifndef __LINUX_MCS_SPINLOCK_H +#define __LINUX_MCS_SPINLOCK_H + +struct mcs_spinlock { + struct mcs_spinlock *next; + int locked; /* 1 if lock acquired */ +}; + +/* + * We don't inline mcs_spin_lock() so that perf can correctly account for the + * time spent in this lock function. + */ +static noinline +void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node) +{ + struct mcs_spinlock *prev; + + /* Init node */ + node->locked = 0; + node->next = NULL; + + prev = xchg(lock, node); + if (likely(prev == NULL)) { + /* Lock acquired */ + node->locked = 1; + return; + } + ACCESS_ONCE(prev->next) = node; + smp_wmb(); + /* Wait until the lock holder passes the lock down */ + while (!ACCESS_ONCE(node->locked)) + arch_mutex_cpu_relax(); +} + +static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node) +{ + struct mcs_spinlock *next = ACCESS_ONCE(node->next); + + if (likely(!next)) { + /* +* Release the lock by setting it to NULL +*/ + if (cmpxchg(lock, node, NULL) == node) + return; + /* Wait until the next pointer is set */ + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); + } + ACCESS_ONCE(next->locked) = 1; + smp_wmb(); +} + +#endif /* __LINUX_MCS_SPINLOCK_H */ diff --git a/include/linux/mutex.h b/include/linux/mutex.h index ccd4260..e6eaeea 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -46,6 +46,7 @@ * - detects multi-task circular deadlocks and prints out all affected * locks and tasks (and only those tasks) */ +struct mcs_spinlock; struct mutex { /* 1: unlocked, 0: locked, negative: locked, possible waiters */ atomic_tcount; @@ -55,7 +56,7 @@ struct mutex { struct task_struct *owner; #endif #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - void*spin_mlock;/* Spinner MCS lock */ + struct mcs_spinlock *mcs_lock; /* Spinner MCS lock */ #endif #ifdef CONFIG_DEBUG_MUTEXES const char *name; @@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); #define arch_mutex_cpu_relax() cpu_relax() #endif -#endif +#endif /* __LINUX_MUTEX_H */ diff --git a/kernel/mutex.c b/kernel/mutex.c index 6d647ae..4640731 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -25,6 +25,7 @@ #include #include #include +#include /* * In the DEBUG case we are using the "NULL fastpath" for mutexes, @@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key) INIT_LIST_HEAD(&lock->wait_list); mutex_clear_owner(lock); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER - lock->spin_mlock = NULL; + lock->mcs_lock = NULL; #endif debug_mutex_init(lock, name, key); @@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock); * more or less simultaneously, the spinners need to acquire a MCS lock * first before spinning on the owner field. * - * We don't inline mspin_lock() so that perf can correctly account for the - * time spent in this lock function. */ -struct mspin_node { - struct mspin_node *next ; - int locked; /* 1 if lock acquired */ -}; -#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock)) - -static noinline -void mspin_lock(struct mspin_node **lock, struct mspin_node *node) -{ - struct mspin_node *prev; - - /* Init node */ - node->
[PATCH v7 2/6] rwsem: remove 'out' label in do_wake
That make code simple and more readable. Signed-off-by: Alex Shi --- lib/rwsem.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 19c5fa9..42f1b1a 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) * will block as they will notice the queued writer. */ wake_up_process(waiter->task); - goto out; + return sem; } /* Writers might steal the lock before we grant it to the next reader. @@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) /* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) - goto out; + return sem; /* Last active locker left. Retry waking readers. */ goto try_reader_grant; } @@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) sem->wait_list.next = next; next->prev = &sem->wait_list; - out: return sem; } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v7 1/6] rwsem: check the lock before cpmxchg in down_write_trylock
Cmpxchg will cause the cacheline bouning when do the value checking, that cause scalability issue in a large machine (like a 80 core box). So a lock pre-read can relief this contention. Signed-off-by: Alex Shi --- include/asm-generic/rwsem.h |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h index bb1e2cd..5ba80e7 100644 --- a/include/asm-generic/rwsem.h +++ b/include/asm-generic/rwsem.h @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem) static inline int __down_write_trylock(struct rw_semaphore *sem) { - long tmp; + if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE)) + return 0; - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, - RWSEM_ACTIVE_WRITE_BIAS); - return tmp == RWSEM_UNLOCKED_VALUE; + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE; } /* -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v7 4/6] rwsem/wake: check lock before do atomic update
Atomic update lock and roll back will cause cache bouncing in large machine. A lock status pre-read can relieve this problem Suggested-by: Davidlohr bueso Suggested-by: Tim Chen Signed-off-by: Alex Shi --- lib/rwsem.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index a8055cf..1d6e6e8 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) struct rwsem_waiter *waiter; struct task_struct *tsk; struct list_head *next; - long oldcount, woken, loop, adjustment; + long woken, loop, adjustment; waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); if (waiter->type == RWSEM_WAITING_FOR_WRITE) { @@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; while (1) { + long oldcount; + + /* A writer stole the lock. */ + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) + return sem; + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; if (likely(oldcount >= RWSEM_WAITING_BIAS)) -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v7 3/6] rwsem: remove try_reader_grant label do_wake
That make code simple and more readable Signed-off-by: Alex Shi --- lib/rwsem.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 42f1b1a..a8055cf 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) adjustment = 0; if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; - try_reader_grant: - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { - /* A writer stole the lock. Undo our reader grant. */ + while (1) { + oldcount = rwsem_atomic_update(adjustment, sem) + - adjustment; + if (likely(oldcount >= RWSEM_WAITING_BIAS)) + break; + +/* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) return sem; /* Last active locker left. Retry waking readers. */ - goto try_reader_grant; } } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Thu, 2013-09-26 at 15:42 -0700, Jason Low wrote: > On Thu, 2013-09-26 at 14:41 -0700, Tim Chen wrote: > > On Thu, 2013-09-26 at 14:09 -0700, Jason Low wrote: > > > On Thu, 2013-09-26 at 13:40 -0700, Davidlohr Bueso wrote: > > > > On Thu, 2013-09-26 at 13:23 -0700, Jason Low wrote: > > > > > On Thu, 2013-09-26 at 13:06 -0700, Davidlohr Bueso wrote: > > > > > > On Thu, 2013-09-26 at 12:27 -0700, Jason Low wrote: > > > > > > > On Wed, Sep 25, 2013 at 3:10 PM, Tim Chen > > > > > > > wrote: > > > > > > > > We will need the MCS lock code for doing optimistic spinning > > > > > > > > for rwsem. > > > > > > > > Extracting the MCS code from mutex.c and put into its own file > > > > > > > > allow us > > > > > > > > to reuse this code easily for rwsem. > > > > > > > > > > > > > > > > Signed-off-by: Tim Chen > > > > > > > > Signed-off-by: Davidlohr Bueso > > > > > > > > --- > > > > > > > > include/linux/mcslock.h | 58 > > > > > > > > +++ > > > > > > > > kernel/mutex.c | 58 > > > > > > > > +- > > > > > > > > 2 files changed, 65 insertions(+), 51 deletions(-) > > > > > > > > create mode 100644 include/linux/mcslock.h > > > > > > > > > > > > > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > > > > > > > > new file mode 100644 > > > > > > > > index 000..20fd3f0 > > > > > > > > --- /dev/null > > > > > > > > +++ b/include/linux/mcslock.h > > > > > > > > @@ -0,0 +1,58 @@ > > > > > > > > +/* > > > > > > > > + * MCS lock defines > > > > > > > > + * > > > > > > > > + * This file contains the main data structure and API > > > > > > > > definitions of MCS lock. > > > > > > > > + */ > > > > > > > > +#ifndef __LINUX_MCSLOCK_H > > > > > > > > +#define __LINUX_MCSLOCK_H > > > > > > > > + > > > > > > > > +struct mcs_spin_node { > > > > > > > > + struct mcs_spin_node *next; > > > > > > > > + int locked; /* 1 if lock acquired */ > > > > > > > > +}; > > > > > > > > + > > > > > > > > +/* > > > > > > > > + * We don't inline mcs_spin_lock() so that perf can correctly > > > > > > > > account for the > > > > > > > > + * time spent in this lock function. > > > > > > > > + */ > > > > > > > > +static noinline > > > > > > > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct > > > > > > > > mcs_spin_node *node) > > > > > > > > +{ > > > > > > > > + struct mcs_spin_node *prev; > > > > > > > > + > > > > > > > > + /* Init node */ > > > > > > > > + node->locked = 0; > > > > > > > > + node->next = NULL; > > > > > > > > + > > > > > > > > + prev = xchg(lock, node); > > > > > > > > + if (likely(prev == NULL)) { > > > > > > > > + /* Lock acquired */ > > > > > > > > + node->locked = 1; > > > > > > > > > > > > > > If we don't spin on the local node, is it necessary to set this > > > > > > > variable? > > > > > > > > > > > > I don't follow, the whole idea is to spin on the local variable. > > > > > > > > > > If prev == NULL, doesn't that mean it won't proceed to spin on the > > > > > variable because the lock is already free and we call return? In that > > > > > case where we directly acquire the lock, I was wondering if it is > > > > > necessary to set node->locked = 1. > > > > > > >
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Fri, 2013-09-27 at 09:12 -0700, Jason Low wrote: > On Fri, 2013-09-27 at 08:02 +0200, Ingo Molnar wrote: > > Would be nice to have this as a separate, add-on patch. Every single > > instruction removal that has no downside is an upside! > > Okay, so here is a patch. Tim, would you like to add this to v7? Okay. Will do. Tim > > ... > Subject: MCS lock: Remove and reorder unnecessary assignments in > mcs_spin_lock() > > In mcs_spin_lock(), if (likely(prev == NULL)) is true, then the lock is free > and we won't spin on the local node. In that case, we don't have to assign > node->locked because it won't be used. We can also move the node->locked = 0 > assignment so that it occurs after the if (likely(prev == NULL)) check. > > This might also help make it clearer as to how the node->locked variable > is used in MCS locks. > > Signed-off-by: Jason Low > --- > include/linux/mcslock.h |3 +-- > 1 files changed, 1 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > index 20fd3f0..1167d57 100644 > --- a/include/linux/mcslock.h > +++ b/include/linux/mcslock.h > @@ -21,15 +21,14 @@ void mcs_spin_lock(struct mcs_spin_node **lock, struct > mcs_spin_node *node) > struct mcs_spin_node *prev; > > /* Init node */ > - node->locked = 0; > node->next = NULL; > > prev = xchg(lock, node); > if (likely(prev == NULL)) { > /* Lock acquired */ > - node->locked = 1; > return; > } > + node->locked = 0; > ACCESS_ONCE(prev->next) = node; > smp_wmb(); > /* Wait until the lock holder passes the lock down */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Fri, 2013-09-27 at 08:29 -0700, Paul E. McKenney wrote: > On Wed, Sep 25, 2013 at 03:10:49PM -0700, Tim Chen wrote: > > We will need the MCS lock code for doing optimistic spinning for rwsem. > > Extracting the MCS code from mutex.c and put into its own file allow us > > to reuse this code easily for rwsem. > > > > Signed-off-by: Tim Chen > > Signed-off-by: Davidlohr Bueso > > --- > > include/linux/mcslock.h | 58 > > +++ > > kernel/mutex.c | 58 > > +- > > 2 files changed, 65 insertions(+), 51 deletions(-) > > create mode 100644 include/linux/mcslock.h > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > > new file mode 100644 > > index 000..20fd3f0 > > --- /dev/null > > +++ b/include/linux/mcslock.h > > @@ -0,0 +1,58 @@ > > +/* > > + * MCS lock defines > > + * > > + * This file contains the main data structure and API definitions of MCS > > lock. > > + */ > > +#ifndef __LINUX_MCSLOCK_H > > +#define __LINUX_MCSLOCK_H > > + > > +struct mcs_spin_node { > > + struct mcs_spin_node *next; > > + int locked; /* 1 if lock acquired */ > > +}; > > + > > +/* > > + * We don't inline mcs_spin_lock() so that perf can correctly account for > > the > > + * time spent in this lock function. > > + */ > > +static noinline > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node) > > +{ > > + struct mcs_spin_node *prev; > > + > > + /* Init node */ > > + node->locked = 0; > > + node->next = NULL; > > + > > + prev = xchg(lock, node); > > + if (likely(prev == NULL)) { > > + /* Lock acquired */ > > + node->locked = 1; > > + return; > > + } > > + ACCESS_ONCE(prev->next) = node; > > + smp_wmb(); > > + /* Wait until the lock holder passes the lock down */ > > + while (!ACCESS_ONCE(node->locked)) > > + arch_mutex_cpu_relax(); > > +} > > + > > +static void mcs_spin_unlock(struct mcs_spin_node **lock, struct > > mcs_spin_node *node) > > +{ > > + struct mcs_spin_node *next = ACCESS_ONCE(node->next); > > + > > + if (likely(!next)) { > > + /* > > +* Release the lock by setting it to NULL > > +*/ > > + if (cmpxchg(lock, node, NULL) == node) > > + return; > > + /* Wait until the next pointer is set */ > > + while (!(next = ACCESS_ONCE(node->next))) > > + arch_mutex_cpu_relax(); > > + } > > + ACCESS_ONCE(next->locked) = 1; > > + smp_wmb(); > > Shouldn't the memory barrier precede the "ACCESS_ONCE(next->locked) = 1;"? > Maybe in an "else" clause of the prior "if" statement, given that the > cmpxchg() does it otherwise. > > Otherwise, in the case where the "if" conditionn is false, the critical > section could bleed out past the unlock. Yes, I agree with you that the smp_wmb should be moved before ACCESS_ONCE to prevent critical section from bleeding. Copying Waiman who is the original author of the mcs code to see if he has any comments on things we may have missed. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Fri, 2013-09-27 at 08:29 -0700, Paul E. McKenney wrote: > On Wed, Sep 25, 2013 at 03:10:49PM -0700, Tim Chen wrote: > > We will need the MCS lock code for doing optimistic spinning for rwsem. > > Extracting the MCS code from mutex.c and put into its own file allow us > > to reuse this code easily for rwsem. > > > > Signed-off-by: Tim Chen > > Signed-off-by: Davidlohr Bueso > > --- > > include/linux/mcslock.h | 58 > > +++ > > kernel/mutex.c | 58 > > +- > > 2 files changed, 65 insertions(+), 51 deletions(-) > > create mode 100644 include/linux/mcslock.h > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > > new file mode 100644 > > index 000..20fd3f0 > > --- /dev/null > > +++ b/include/linux/mcslock.h > > @@ -0,0 +1,58 @@ > > +/* > > + * MCS lock defines > > + * > > + * This file contains the main data structure and API definitions of MCS > > lock. > > + */ > > +#ifndef __LINUX_MCSLOCK_H > > +#define __LINUX_MCSLOCK_H > > + > > +struct mcs_spin_node { > > + struct mcs_spin_node *next; > > + int locked; /* 1 if lock acquired */ > > +}; > > + > > +/* > > + * We don't inline mcs_spin_lock() so that perf can correctly account for > > the > > + * time spent in this lock function. > > + */ > > +static noinline > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node) > > +{ > > + struct mcs_spin_node *prev; > > + > > + /* Init node */ > > + node->locked = 0; > > + node->next = NULL; > > + > > + prev = xchg(lock, node); > > + if (likely(prev == NULL)) { > > + /* Lock acquired */ > > + node->locked = 1; > > + return; > > + } > > + ACCESS_ONCE(prev->next) = node; > > + smp_wmb(); BTW, is the above memory barrier necessary? It seems like the xchg instruction already provided a memory barrier. Now if we made the changes that Jason suggested: /* Init node */ - node->locked = 0; node->next = NULL; prev = xchg(lock, node); if (likely(prev == NULL)) { /* Lock acquired */ - node->locked = 1; return; } + node->locked = 0; ACCESS_ONCE(prev->next) = node; smp_wmb(); We are probably still okay as other cpus do not read the value of node->locked, which is a local variable. Tim > > + /* Wait until the lock holder passes the lock down */ > > + while (!ACCESS_ONCE(node->locked)) > > + arch_mutex_cpu_relax(); > > +} > > + > > +static void mcs_spin_unlock(struct mcs_spin_node **lock, struct > > mcs_spin_node *node) > > +{ > > + struct mcs_spin_node *next = ACCESS_ONCE(node->next); > > + > > + if (likely(!next)) { > > + /* > > +* Release the lock by setting it to NULL > > +*/ > > + if (cmpxchg(lock, node, NULL) == node) > > + return; > > + /* Wait until the next pointer is set */ > > + while (!(next = ACCESS_ONCE(node->next))) > > + arch_mutex_cpu_relax(); > > + } > > + ACCESS_ONCE(next->locked) = 1; > > + smp_wmb(); > > Shouldn't the memory barrier precede the "ACCESS_ONCE(next->locked) = 1;"? > Maybe in an "else" clause of the prior "if" statement, given that the > cmpxchg() does it otherwise. > > Otherwise, in the case where the "if" conditionn is false, the critical > section could bleed out past the unlock. > > Thanx, Paul > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path
On Fri, 2013-09-27 at 12:39 -0700, Davidlohr Bueso wrote: > On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote: > > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long wrote: > > > > > > On a large NUMA machine, it is entirely possible that a fairly large > > > number of threads are queuing up in the ticket spinlock queue to do > > > the wakeup operation. In fact, only one will be needed. This patch > > > tries to reduce spinlock contention by doing just that. > > > > > > A new wakeup field is added to the rwsem structure. This field is > > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a > > > thread is pending to do the wakeup call. It is cleared on exit from > > > those functions. > > > > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm > > wondering what the performance difference between the two are. > > Both approaches should be complementary. The idea of optimistic spinning > in rwsems is to avoid putting putting the writer on the wait queue - > reducing contention and giving a greater chance for the rwsem > to get acquired. Waiman's approach is once the blocking actually occurs, > and at this point I'm not sure how this will affect writer stealing > logic. > I agree with the view that the two approaches are complementary to each other. They address different bottleneck areas in the rwsem. Here're the performance numbers for exim workload compared to a vanilla kernel. Waimain's patch:+2.0% Alex+Tim's patchset:+4.8% Waiman+Alex+Tim:+5.3% Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Fri, 2013-09-27 at 13:38 -0700, Paul E. McKenney wrote: > On Fri, Sep 27, 2013 at 12:38:53PM -0700, Tim Chen wrote: > > On Fri, 2013-09-27 at 08:29 -0700, Paul E. McKenney wrote: > > > On Wed, Sep 25, 2013 at 03:10:49PM -0700, Tim Chen wrote: > > > > We will need the MCS lock code for doing optimistic spinning for rwsem. > > > > Extracting the MCS code from mutex.c and put into its own file allow us > > > > to reuse this code easily for rwsem. > > > > > > > > Signed-off-by: Tim Chen > > > > Signed-off-by: Davidlohr Bueso > > > > --- > > > > include/linux/mcslock.h | 58 > > > > +++ > > > > kernel/mutex.c | 58 > > > > +- > > > > 2 files changed, 65 insertions(+), 51 deletions(-) > > > > create mode 100644 include/linux/mcslock.h > > > > > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > > > > new file mode 100644 > > > > index 000..20fd3f0 > > > > --- /dev/null > > > > +++ b/include/linux/mcslock.h > > > > @@ -0,0 +1,58 @@ > > > > +/* > > > > + * MCS lock defines > > > > + * > > > > + * This file contains the main data structure and API definitions of > > > > MCS lock. > > > > + */ > > > > +#ifndef __LINUX_MCSLOCK_H > > > > +#define __LINUX_MCSLOCK_H > > > > + > > > > +struct mcs_spin_node { > > > > + struct mcs_spin_node *next; > > > > + int locked; /* 1 if lock acquired */ > > > > +}; > > > > + > > > > +/* > > > > + * We don't inline mcs_spin_lock() so that perf can correctly account > > > > for the > > > > + * time spent in this lock function. > > > > + */ > > > > +static noinline > > > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node > > > > *node) > > > > +{ > > > > + struct mcs_spin_node *prev; > > > > + > > > > + /* Init node */ > > > > + node->locked = 0; > > > > + node->next = NULL; > > > > + > > > > + prev = xchg(lock, node); > > > > + if (likely(prev == NULL)) { > > > > + /* Lock acquired */ > > > > + node->locked = 1; > > > > + return; > > > > + } > > > > + ACCESS_ONCE(prev->next) = node; > > > > + smp_wmb(); > > > > BTW, is the above memory barrier necessary? It seems like the xchg > > instruction already provided a memory barrier. > > > > Now if we made the changes that Jason suggested: > > > > > > /* Init node */ > > - node->locked = 0; > > node->next = NULL; > > > > prev = xchg(lock, node); > > if (likely(prev == NULL)) { > > /* Lock acquired */ > > - node->locked = 1; > > return; > > } > > + node->locked = 0; > > ACCESS_ONCE(prev->next) = node; > > smp_wmb(); > > > > We are probably still okay as other cpus do not read the value of > > node->locked, which is a local variable. > > I don't immediately see the need for the smp_wmb() in either case. Thinking a bit more, the following could happen in Jason's initial patch proposal. In this case variable "prev" referenced by CPU1 points to "node" referenced by CPU2 CPU 1 (calling lock)CPU 2 (calling unlock) ACCESS_ONCE(prev->next) = node *next = ACCESS_ONCE(node->next); ACCESS_ONCE(next->locked) = 1; node->locked = 0; Then we will be spinning forever on CPU1 as we overwrite the lock passed from CPU2 before we check it. The original code assign "node->locked = 0" before xchg does not have this issue. Doing the following change of moving smp_wmb immediately after node->locked assignment (suggested by Jason) node->locked = 0; smp_wmb(); ACCESS_ONCE(prev->next) = node; could avoid the problem, but will need closer scrutiny to see if there are other pitfalls if wmb happen before ACCESS_ONCE(prev->next) = node; > > > > > > + /* Wait until the lock holder passes the lock down */ > > > > + while (!ACCESS_ONCE(node->locked)) > > > > + arch_mutex_cpu_relax(); > > However, you do need a full memory barrier here in order to ensure that > you see the effects of the previous lock holder's critical section. Is it necessary to add a memory barrier after acquiring the lock if the previous lock holder execute smp_wmb before passing the lock? Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Fri, 2013-09-27 at 19:19 -0700, Paul E. McKenney wrote: > On Fri, Sep 27, 2013 at 04:54:06PM -0700, Jason Low wrote: > > On Fri, Sep 27, 2013 at 4:01 PM, Paul E. McKenney > > wrote: > > > Yep. The previous lock holder's smp_wmb() won't keep either the compiler > > > or the CPU from reordering things for the new lock holder. They could for > > > example reorder the critical section to precede the node->locked check, > > > which would be very bad. > > > > Paul, Tim, Longman, > > > > How would you like the proposed changes below? > > Could you point me at what this applies to? I can find flaws looking > at random pieces, given a little luck, but at some point I need to look > at the whole thing. ;-) > > Thanx, Paul Jason's patch is on top of the following patchset: https://lkml.org/lkml/2013/9/26/674 Thanks. Tim > > > --- > > Subject: [PATCH] MCS: optimizations and barrier corrections > > > > Delete the node->locked = 1 assignment if the lock is free as it won't be > > used. > > > > Delete the smp_wmb() in mcs_spin_lock() and add a full memory barrier at the > > end of the mcs_spin_lock() function. As Paul McKenney suggested, "you do > > need a > > full memory barrier here in order to ensure that you see the effects of the > > previous lock holder's critical section." And in the mcs_spin_unlock(), > > move the > > memory barrier so that it is before the "ACCESS_ONCE(next->locked) = 1;". > > > > Signed-off-by: Jason Low > > Signed-off-by: Paul E. McKenney > > Signed-off-by: Tim Chen > > --- > > include/linux/mcslock.h |7 +++ > > 1 files changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h > > index 20fd3f0..edd57d2 100644 > > --- a/include/linux/mcslock.h > > +++ b/include/linux/mcslock.h > > @@ -26,15 +26,14 @@ void mcs_spin_lock(struct mcs_spin_node **lock, > > struct mcs_spin_node *node) > > > > prev = xchg(lock, node); > > if (likely(prev == NULL)) { > > - /* Lock acquired */ > > - node->locked = 1; > > + /* Lock acquired. No need to set node->locked since it > > won't be used */ > > return; > > } > > ACCESS_ONCE(prev->next) = node; > > - smp_wmb(); > > /* Wait until the lock holder passes the lock down */ > > while (!ACCESS_ONCE(node->locked)) > > arch_mutex_cpu_relax(); > > + smp_mb(); > > } > > > > static void mcs_spin_unlock(struct mcs_spin_node **lock, struct > > mcs_spin_node *node) > > @@ -51,8 +50,8 @@ static void mcs_spin_unlock(struct mcs_spin_node > > **lock, struct mcs_spin_node *n > > while (!(next = ACCESS_ONCE(node->next))) > > arch_mutex_cpu_relax(); > > } > > - ACCESS_ONCE(next->locked) = 1; > > smp_wmb(); > > + ACCESS_ONCE(next->locked) = 1; > > } > > > > #endif > > -- > > 1.7.1 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
On Sat, 2013-09-28 at 21:52 +0200, Ingo Molnar wrote: > * Linus Torvalds wrote: > > > On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar wrote: > > > > > > - down_write_nest_lock(&anon_vma->root->rwsem, > > > &mm->mmap_sem); > > > + down_write_nest_lock(&anon_vma->root->rwlock, > > > &mm->mmap_sem); > > > > That's just completely bogus, and cannot work. > > > > Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just > > anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep > > issue. I'm not quite sure what's up with the nesting there. > > > > > - if (rwsem_is_locked(&anon_vma->root->rwsem)) { > > > + if (write_can_lock(&anon_vma->root->rwlock)) { > > > anon_vma_lock_write(anon_vma); > > > anon_vma_unlock_write(anon_vma); > > > } > > > > That's the wrong way around. It should be > > > > if (!write_can_lock(&anon_vma->root->rwlock)) { > > > > so some more testing definitely needed. > > Yeah, that silly API asymmetry has bitten me before as well :-/ > > The attached version booted up fine under 16-way KVM: > > sh-4.2# uptime > 19:50:08 up 0 min, 0 users, load average: 0.00, 0.00, 0.00 > > That's all the testing it will get this evening though. Patch should be > good enough for Tim to try? Here's the exim workload data: rwsem improvment: Waimain's patch:+2.0% Alex+Tim's patchset:+4.8% Waiman+Alex+Tim:+5.3% convert rwsem to rwlock_t for root anon_vma lock Ingo's patch+11.7% Yes, changing the anon-vma root lock to rwlock_t gives the most boost. However, I think that Waiman's patches, Alex's patches and my patches can still be combined together to improve scalability of rwsem, even if we should decide to convert this particular rwsem to rwlock_t. Thanks. Tim > > Thanks, > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: > On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > > Here's the exim workload data: > > > > rwsem improvment: > > Waimain's patch:+2.0% > > Alex+Tim's patchset:+4.8% > > Waiman+Alex+Tim:+5.3% > > > > convert rwsem to rwlock_t for root anon_vma lock > > Ingo's patch+11.7% > > > > What happens if you stuff Waiman's qrwlock patches on top of that? > admittedly and oft mentioned in this thread, our current rwlock_t is > somewhat suboptimal under a number of conditions. I've tested with Waiman's qrwlock patches on top of Ingo's patches. It does not affect the throughput for exim and I still get about +11.7% throughput change (same as with Ingo's patch only). Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote: > On 09/30/2013 03:23 PM, Tim Chen wrote: > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > >>> Here's the exim workload data: > >>> > >>> rwsem improvment: > >>> Waimain's patch:+2.0% > >>> Alex+Tim's patchset:+4.8% > >>> Waiman+Alex+Tim:+5.3% > >>> > >>> convert rwsem to rwlock_t for root anon_vma lock > >>> Ingo's patch +11.7% > >>> > >> What happens if you stuff Waiman's qrwlock patches on top of that? > >> admittedly and oft mentioned in this thread, our current rwlock_t is > >> somewhat suboptimal under a number of conditions. > > I've tested with Waiman's qrwlock patches on top of Ingo's patches. > > It does not affect the throughput for exim and I still get > > about +11.7% throughput change (same as with Ingo's patch only). > > > > Tim > > > > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig > to explicitly enable it. Have you done that when you build the test > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma > lock is converted back to a rwlock. > Yes, I have explicitly enabled it during my testing. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t
On Mon, 2013-09-30 at 12:47 -0700, Tim Chen wrote: > On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote: > > On 09/30/2013 03:23 PM, Tim Chen wrote: > > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote: > > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote: > > >>> Here's the exim workload data: > > >>> > > >>> rwsem improvment: > > >>> Waimain's patch:+2.0% > > >>> Alex+Tim's patchset:+4.8% > > >>> Waiman+Alex+Tim:+5.3% > > >>> > > >>> convert rwsem to rwlock_t for root anon_vma lock > > >>> Ingo's patch+11.7% > > >>> > > >> What happens if you stuff Waiman's qrwlock patches on top of that? > > >> admittedly and oft mentioned in this thread, our current rwlock_t is > > >> somewhat suboptimal under a number of conditions. > > > I've tested with Waiman's qrwlock patches on top of Ingo's patches. > > > It does not affect the throughput for exim and I still get > > > about +11.7% throughput change (same as with Ingo's patch only). > > > > > > Tim > > > > > > > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig > > to explicitly enable it. Have you done that when you build the test > > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma > > lock is converted back to a rwlock. > > > > Yes, I have explicitly enabled it during my testing. > The workload I have is dominated by writer locks, with only very occasional readers. So it may not benefit from the various tweaks you have put in qrwlock. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5 0/6] rwsem: performance optimizations
We have incorporated various suggestions from Ingo for version 5 of this patchset and will like to have it merged if there are no objections. In this patchset, we introduce two categories of optimizations to read write semaphore. The first four patches from Alex Shi reduce cache bouncing of the sem->count field by doing a pre-read of the sem->count and avoid cmpxchg if possible. The last two patches introduce similar optimistic spinning logic as the mutex code for the writer lock acquisition of rwsem. Without these optimizations, Davidlohr Bueso saw a -8% regression to aim7's shared and high_systime workloads when he switched i_mmap_mutex to rwsem. Tests were on 8 socket 80 cores system. With the patchset, he get significant improvements to the aim7 suite instead of regressions: alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%), shared (+18.4%) and short (+6.3%). Tim Chen also got a +5% improvements to exim mail server workload on a 40 core system. Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing this patchset. Regards, Tim Chen Changelog: v5: 1. Try optimistic spinning before we put the writer on the wait queue to avoid bottlenecking at wait queue. This provides 5% boost to exim workload and between 2% to 8% boost to aim7. 2. Put MCS locking code into its own mcslock.h file for better reuse between mutex.c and rwsem.c 3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the operations default per Ingo's suggestions. v4: 1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner 2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option v3: 1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner. 2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig v2: 1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed a bug referencing &sem->count when sem->count is intended. 2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner. the option to be on for more seasoning but can be turned off should it be detrimental. 3. Various patch comments update Alex Shi (4): rwsem: check the lock before cpmxchg in down_write_trylock rwsem: remove 'out' label in do_wake rwsem: remove try_reader_grant label do_wake rwsem/wake: check lock before do atomic update Tim Chen (2): MCS Lock: Restructure the MCS lock defines and locking code into its own file rwsem: do optimistic spinning for writer lock acquisition include/asm-generic/rwsem.h |8 +- include/linux/rwsem.h |8 +- kernel/mutex.c | 58 ++-- kernel/rwsem.c | 19 - lib/rwsem.c | 218 ++- 5 files changed, 228 insertions(+), 83 deletions(-) -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5 3/6] rwsem: remove try_reader_grant label do_wake
That make code simple and more readable Signed-off-by: Alex Shi --- lib/rwsem.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 42f1b1a..a8055cf 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) adjustment = 0; if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; - try_reader_grant: - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { - /* A writer stole the lock. Undo our reader grant. */ + while (1) { + oldcount = rwsem_atomic_update(adjustment, sem) + - adjustment; + if (likely(oldcount >= RWSEM_WAITING_BIAS)) + break; + +/* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) return sem; /* Last active locker left. Retry waking readers. */ - goto try_reader_grant; } } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5 4/6] rwsem/wake: check lock before do atomic update
Atomic update lock and roll back will cause cache bouncing in large machine. A lock status pre-read can relieve this problem Suggested-by: Davidlohr bueso Suggested-by: Tim Chen Signed-off-by: Alex Shi --- lib/rwsem.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index a8055cf..1d6e6e8 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) struct rwsem_waiter *waiter; struct task_struct *tsk; struct list_head *next; - long oldcount, woken, loop, adjustment; + long woken, loop, adjustment; waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); if (waiter->type == RWSEM_WAITING_FOR_WRITE) { @@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; while (1) { + long oldcount; + + /* A writer stole the lock. */ + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) + return sem; + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; if (likely(oldcount >= RWSEM_WAITING_BIAS)) -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5 6/6] rwsem: do optimistic spinning for writer lock acquisition
We want to add optimistic spinning to rwsems because the writer rwsem does not perform as well as mutexes. Tim noticed that for exim (mail server) workloads, when reverting commit 4fc3f1d6 and Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some aim7 workloads. We've noticed that the biggest difference is when we fail to acquire a mutex in the fastpath, optimistic spinning comes in to play and we can avoid a large amount of unnecessary sleeping and overhead of moving tasks in and out of wait queue. Allowing optimistic spinning before putting the writer on the wait queue reduces wait queue contention and provided greater chance for the rwsem to get acquired. With these changes, rwsem is on par with mutex. Reviewed-by: Peter Zijlstra Reviewed-by: Peter Hurley Reviewed-by: Ingo Molnar Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/rwsem.h |8 ++- kernel/rwsem.c| 19 +- lib/rwsem.c | 193 + 3 files changed, 201 insertions(+), 19 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 0616ffe..4dfc6ae 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -29,6 +29,8 @@ struct rw_semaphore { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif + struct task_struct *owner; + void*spin_mlock; }; extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); @@ -58,8 +60,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) #define __RWSEM_INITIALIZER(name) \ { RWSEM_UNLOCKED_VALUE, \ __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ - LIST_HEAD_INIT((name).wait_list) \ - __RWSEM_DEP_MAP_INIT(name) } + LIST_HEAD_INIT((name).wait_list), \ + __RWSEM_DEP_MAP_INIT(name)\ + NULL, \ + NULL } #define DECLARE_RWSEM(name) \ struct rw_semaphore name = __RWSEM_INITIALIZER(name) diff --git a/kernel/rwsem.c b/kernel/rwsem.c index cfff143..d74d1c9 100644 --- a/kernel/rwsem.c +++ b/kernel/rwsem.c @@ -12,6 +12,16 @@ #include +static inline void rwsem_set_owner(struct rw_semaphore *sem) +{ + sem->owner = current; +} + +static inline void rwsem_clear_owner(struct rw_semaphore *sem) +{ + sem->owner = NULL; +} + /* * lock for reading */ @@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem) rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write); @@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem) { int ret = __down_write_trylock(sem); - if (ret == 1) + if (ret == 1) { rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_); + rwsem_set_owner(sem); + } return ret; } @@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem) rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(up_write); @@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem) * dependency. */ __downgrade_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(downgrade_write); @@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest) rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(_down_write_nest_lock); @@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass) rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write_nested); diff --git a/lib/rwsem.c b/lib/rwsem.c index 1d6e6e8..1752ec9 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include /* * Initialize an rwsem: @@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, sem->count = RWSEM_UNLOCKED_VALUE; raw_spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); + sem->owner = NULL; + sem->spin_mlock = NULL; } EXPORT_SYMBOL(__init_rwsem); @@ -194,14 +198,177 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) return sem; } +static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem) +{ + if (!(count & RWSEM_ACTIVE_MASK)) { + /* Try acquiring the write lock. */ + if (se
[PATCH v5 2/6] rwsem: remove 'out' label in do_wake
That make code simple and more readable. Signed-off-by: Alex Shi --- lib/rwsem.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 19c5fa9..42f1b1a 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) * will block as they will notice the queued writer. */ wake_up_process(waiter->task); - goto out; + return sem; } /* Writers might steal the lock before we grant it to the next reader. @@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) /* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) - goto out; + return sem; /* Last active locker left. Retry waking readers. */ goto try_reader_grant; } @@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) sem->wait_list.next = next; next->prev = &sem->wait_list; - out: return sem; } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5 1/6] rwsem: check the lock before cpmxchg in down_write_trylock
Cmpxchg will cause the cacheline bouning when do the value checking, that cause scalability issue in a large machine (like a 80 core box). So a lock pre-read can relief this contention. Signed-off-by: Alex Shi --- include/asm-generic/rwsem.h |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h index bb1e2cd..5ba80e7 100644 --- a/include/asm-generic/rwsem.h +++ b/include/asm-generic/rwsem.h @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem) static inline int __down_write_trylock(struct rw_semaphore *sem) { - long tmp; + if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE)) + return 0; - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, - RWSEM_ACTIVE_WRITE_BIAS); - return tmp == RWSEM_UNLOCKED_VALUE; + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE; } /* -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
We will need the MCS lock code for doing optimistic spinning for rwsem. Extracting the MCS code from mutex.c and put into its own file allow us to reuse this code easily for rwsem. Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- kernel/mutex.c | 58 ++- 1 files changed, 7 insertions(+), 51 deletions(-) diff --git a/kernel/mutex.c b/kernel/mutex.c index 6d647ae..1b6ba3f 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -25,6 +25,7 @@ #include #include #include +#include /* * In the DEBUG case we are using the "NULL fastpath" for mutexes, @@ -111,54 +112,9 @@ EXPORT_SYMBOL(mutex_lock); * more or less simultaneously, the spinners need to acquire a MCS lock * first before spinning on the owner field. * - * We don't inline mspin_lock() so that perf can correctly account for the - * time spent in this lock function. */ -struct mspin_node { - struct mspin_node *next ; - int locked; /* 1 if lock acquired */ -}; -#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock)) -static noinline -void mspin_lock(struct mspin_node **lock, struct mspin_node *node) -{ - struct mspin_node *prev; - - /* Init node */ - node->locked = 0; - node->next = NULL; - - prev = xchg(lock, node); - if (likely(prev == NULL)) { - /* Lock acquired */ - node->locked = 1; - return; - } - ACCESS_ONCE(prev->next) = node; - smp_wmb(); - /* Wait until the lock holder passes the lock down */ - while (!ACCESS_ONCE(node->locked)) - arch_mutex_cpu_relax(); -} - -static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node) -{ - struct mspin_node *next = ACCESS_ONCE(node->next); - - if (likely(!next)) { - /* -* Release the lock by setting it to NULL -*/ - if (cmpxchg(lock, node, NULL) == node) - return; - /* Wait until the next pointer is set */ - while (!(next = ACCESS_ONCE(node->next))) - arch_mutex_cpu_relax(); - } - ACCESS_ONCE(next->locked) = 1; - smp_wmb(); -} +#defineMLOCK(mutex)((struct mcs_spin_node **)&((mutex)->spin_mlock)) /* * Mutex spinning code migrated from kernel/sched/core.c @@ -448,7 +404,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, for (;;) { struct task_struct *owner; - struct mspin_node node; + struct mcs_spin_node node; if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) { struct ww_mutex *ww; @@ -470,10 +426,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * If there's an owner, wait for it to either * release the lock or go to sleep. */ - mspin_lock(MLOCK(lock), &node); + mcs_spin_lock(MLOCK(lock), &node); owner = ACCESS_ONCE(lock->owner); if (owner && !mutex_spin_on_owner(lock, owner)) { - mspin_unlock(MLOCK(lock), &node); + mcs_spin_unlock(MLOCK(lock), &node); goto slowpath; } @@ -488,11 +444,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, } mutex_set_owner(lock); - mspin_unlock(MLOCK(lock), &node); + mcs_spin_unlock(MLOCK(lock), &node); preempt_enable(); return 0; } - mspin_unlock(MLOCK(lock), &node); + mcs_spin_unlock(MLOCK(lock), &node); /* * When there's no owner, we might have preempted between the -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 1/6] rwsem: check the lock before cpmxchg in down_write_trylock
On Tue, 2013-09-24 at 16:22 -0700, Jason Low wrote: > Should we do something similar with __down_read_trylock, such as > the following? > > > Signed-off-by: Jason Low > --- > include/asm-generic/rwsem.h |3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h > index bb1e2cd..47990dc 100644 > --- a/include/asm-generic/rwsem.h > +++ b/include/asm-generic/rwsem.h > @@ -42,6 +42,9 @@ static inline int __down_read_trylock(struct > rw_semaphore *sem) > long tmp; > > while ((tmp = sem->count) >= 0) { > + if (sem->count != tmp) > + continue; > + Considering that tmp has just been assigned the value of sem->count, the added if check failure is unlikely and probably not needed. We should proceed to cmpxchg below. > if (tmp == cmpxchg(&sem->count, tmp, >tmp + RWSEM_ACTIVE_READ_BIAS)) { > return 1; Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
On Wed, 2013-09-25 at 07:55 +0200, Peter Zijlstra wrote: > On Tue, Sep 24, 2013 at 03:22:46PM -0700, Tim Chen wrote: > > We will need the MCS lock code for doing optimistic spinning for rwsem. > > Extracting the MCS code from mutex.c and put into its own file allow us > > to reuse this code easily for rwsem. > > > > Signed-off-by: Tim Chen > > Signed-off-by: Davidlohr Bueso > > --- > > kernel/mutex.c | 58 > > ++- > > 1 files changed, 7 insertions(+), 51 deletions(-) > > Wasn't this patch supposed to add include/linux/mcslock.h ? Thanks for catching it. I will correct it. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v6 0/6] rwsem: performance optimizations
We fixed a missing file and fixed various style issues for version 6 of this patchset. We will like to have it merged if there are no objections. In this patchset, we introduce two categories of optimizations to read write semaphore. The first four patches from Alex Shi reduce cache bouncing of the sem->count field by doing a pre-read of the sem->count and avoid cmpxchg if possible. The last two patches introduce similar optimistic spinning logic as the mutex code for the writer lock acquisition of rwsem. This addresses the general 'mutexes out perform writer-rwsems' situations that has been seen in more than one case. Users now need not worry about performance issues when choosing between these two locking mechanisms. Without these optimizations, Davidlohr Bueso saw a -8% regression to aim7's shared and high_systime workloads when he switched i_mmap_mutex to rwsem. Tests were on 8 socket 80 cores system. With the patchset, he got significant improvements to the aim7 suite instead of regressions: alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%), shared (+18.4%) and short (+6.3%). Tim Chen also got a +5% improvements to exim mail server workload on a 40 core system. Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing this patchset. Regards, Tim Chen Changelog: v6: 1. Fix missing mcslock.h file. 2. Fix various code style issues. v5: 1. Try optimistic spinning before we put the writer on the wait queue to avoid bottlenecking at wait queue. This provides 5% boost to exim workload and between 2% to 8% boost to aim7. 2. Put MCS locking code into its own mcslock.h file for better reuse between mutex.c and rwsem.c 3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the operations default per Ingo's suggestions. v4: 1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner 2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option v3: 1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner. 2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig v2: 1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed a bug referencing &sem->count when sem->count is intended. 2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner. the option to be on for more seasoning but can be turned off should it be detrimental. 3. Various patch comments update Alex Shi (4): rwsem: check the lock before cpmxchg in down_write_trylock rwsem: remove 'out' label in do_wake rwsem: remove try_reader_grant label do_wake rwsem/wake: check lock before do atomic update Tim Chen (2): MCS Lock: Restructure the MCS lock defines and locking code into its own file rwsem: do optimistic spinning for writer lock acquisition include/asm-generic/rwsem.h |8 +- include/linux/mcslock.h | 58 +++ include/linux/rwsem.h |6 +- kernel/mutex.c | 58 ++-- kernel/rwsem.c | 19 - lib/rwsem.c | 228 +- 6 files changed, 292 insertions(+), 85 deletions(-) create mode 100644 include/linux/mcslock.h -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file
We will need the MCS lock code for doing optimistic spinning for rwsem. Extracting the MCS code from mutex.c and put into its own file allow us to reuse this code easily for rwsem. Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/mcslock.h | 58 +++ kernel/mutex.c | 58 +- 2 files changed, 65 insertions(+), 51 deletions(-) create mode 100644 include/linux/mcslock.h diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h new file mode 100644 index 000..20fd3f0 --- /dev/null +++ b/include/linux/mcslock.h @@ -0,0 +1,58 @@ +/* + * MCS lock defines + * + * This file contains the main data structure and API definitions of MCS lock. + */ +#ifndef __LINUX_MCSLOCK_H +#define __LINUX_MCSLOCK_H + +struct mcs_spin_node { + struct mcs_spin_node *next; + int locked; /* 1 if lock acquired */ +}; + +/* + * We don't inline mcs_spin_lock() so that perf can correctly account for the + * time spent in this lock function. + */ +static noinline +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node) +{ + struct mcs_spin_node *prev; + + /* Init node */ + node->locked = 0; + node->next = NULL; + + prev = xchg(lock, node); + if (likely(prev == NULL)) { + /* Lock acquired */ + node->locked = 1; + return; + } + ACCESS_ONCE(prev->next) = node; + smp_wmb(); + /* Wait until the lock holder passes the lock down */ + while (!ACCESS_ONCE(node->locked)) + arch_mutex_cpu_relax(); +} + +static void mcs_spin_unlock(struct mcs_spin_node **lock, struct mcs_spin_node *node) +{ + struct mcs_spin_node *next = ACCESS_ONCE(node->next); + + if (likely(!next)) { + /* +* Release the lock by setting it to NULL +*/ + if (cmpxchg(lock, node, NULL) == node) + return; + /* Wait until the next pointer is set */ + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); + } + ACCESS_ONCE(next->locked) = 1; + smp_wmb(); +} + +#endif diff --git a/kernel/mutex.c b/kernel/mutex.c index 6d647ae..1b6ba3f 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -25,6 +25,7 @@ #include #include #include +#include /* * In the DEBUG case we are using the "NULL fastpath" for mutexes, @@ -111,54 +112,9 @@ EXPORT_SYMBOL(mutex_lock); * more or less simultaneously, the spinners need to acquire a MCS lock * first before spinning on the owner field. * - * We don't inline mspin_lock() so that perf can correctly account for the - * time spent in this lock function. */ -struct mspin_node { - struct mspin_node *next ; - int locked; /* 1 if lock acquired */ -}; -#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock)) -static noinline -void mspin_lock(struct mspin_node **lock, struct mspin_node *node) -{ - struct mspin_node *prev; - - /* Init node */ - node->locked = 0; - node->next = NULL; - - prev = xchg(lock, node); - if (likely(prev == NULL)) { - /* Lock acquired */ - node->locked = 1; - return; - } - ACCESS_ONCE(prev->next) = node; - smp_wmb(); - /* Wait until the lock holder passes the lock down */ - while (!ACCESS_ONCE(node->locked)) - arch_mutex_cpu_relax(); -} - -static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node) -{ - struct mspin_node *next = ACCESS_ONCE(node->next); - - if (likely(!next)) { - /* -* Release the lock by setting it to NULL -*/ - if (cmpxchg(lock, node, NULL) == node) - return; - /* Wait until the next pointer is set */ - while (!(next = ACCESS_ONCE(node->next))) - arch_mutex_cpu_relax(); - } - ACCESS_ONCE(next->locked) = 1; - smp_wmb(); -} +#defineMLOCK(mutex)((struct mcs_spin_node **)&((mutex)->spin_mlock)) /* * Mutex spinning code migrated from kernel/sched/core.c @@ -448,7 +404,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, for (;;) { struct task_struct *owner; - struct mspin_node node; + struct mcs_spin_node node; if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 0) { struct ww_mutex *ww; @@ -470,10 +426,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
[PATCH v6 3/6] rwsem: remove try_reader_grant label do_wake
That make code simple and more readable Signed-off-by: Alex Shi --- lib/rwsem.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 42f1b1a..a8055cf 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) adjustment = 0; if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; - try_reader_grant: - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { - /* A writer stole the lock. Undo our reader grant. */ + while (1) { + oldcount = rwsem_atomic_update(adjustment, sem) + - adjustment; + if (likely(oldcount >= RWSEM_WAITING_BIAS)) + break; + +/* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) return sem; /* Last active locker left. Retry waking readers. */ - goto try_reader_grant; } } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v6 6/6] rwsem: do optimistic spinning for writer lock acquisition
We want to add optimistic spinning to rwsems because the writer rwsem does not perform as well as mutexes. Tim noticed that for exim (mail server) workloads, when reverting commit 4fc3f1d6 and Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some aim7 workloads. We've noticed that the biggest difference is when we fail to acquire a mutex in the fastpath, optimistic spinning comes in to play and we can avoid a large amount of unnecessary sleeping and overhead of moving tasks in and out of wait queue. Allowing optimistic spinning before putting the writer on the wait queue reduces wait queue contention and provided greater chance for the rwsem to get acquired. With these changes, rwsem is on par with mutex. Reviewed-by: Ingo Molnar Reviewed-by: Peter Zijlstra Reviewed-by: Peter Hurley Signed-off-by: Tim Chen Signed-off-by: Davidlohr Bueso --- include/linux/rwsem.h |6 +- kernel/rwsem.c| 19 +- lib/rwsem.c | 203 - 3 files changed, 207 insertions(+), 21 deletions(-) diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 0616ffe..ef5a83a 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -26,6 +26,8 @@ struct rw_semaphore { longcount; raw_spinlock_t wait_lock; struct list_headwait_list; + struct task_struct *owner; /* write owner */ + void*spin_mlock; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif @@ -58,7 +60,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) #define __RWSEM_INITIALIZER(name) \ { RWSEM_UNLOCKED_VALUE, \ __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ - LIST_HEAD_INIT((name).wait_list) \ + LIST_HEAD_INIT((name).wait_list), \ + NULL, \ + NULL \ __RWSEM_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(name) \ diff --git a/kernel/rwsem.c b/kernel/rwsem.c index cfff143..d74d1c9 100644 --- a/kernel/rwsem.c +++ b/kernel/rwsem.c @@ -12,6 +12,16 @@ #include +static inline void rwsem_set_owner(struct rw_semaphore *sem) +{ + sem->owner = current; +} + +static inline void rwsem_clear_owner(struct rw_semaphore *sem) +{ + sem->owner = NULL; +} + /* * lock for reading */ @@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem) rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write); @@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem) { int ret = __down_write_trylock(sem); - if (ret == 1) + if (ret == 1) { rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_); + rwsem_set_owner(sem); + } return ret; } @@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem) rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(up_write); @@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem) * dependency. */ __downgrade_write(sem); + rwsem_clear_owner(sem); } EXPORT_SYMBOL(downgrade_write); @@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest) rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(_down_write_nest_lock); @@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass) rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write_nested); diff --git a/lib/rwsem.c b/lib/rwsem.c index 1d6e6e8..9535ef7 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include /* * Initialize an rwsem: @@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, sem->count = RWSEM_UNLOCKED_VALUE; raw_spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); + sem->owner = NULL; + sem->spin_mlock = NULL; } EXPORT_SYMBOL(__init_rwsem); @@ -194,14 +198,179 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) return sem; } +static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem) +{ + if (!(count & RWSEM_ACTIVE_MASK)) { + /* Try acquiring the write lock. */ + if (sem->count == RWSEM
[PATCH v6 2/6] rwsem: remove 'out' label in do_wake
That make code simple and more readable. Signed-off-by: Alex Shi --- lib/rwsem.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index 19c5fa9..42f1b1a 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) * will block as they will notice the queued writer. */ wake_up_process(waiter->task); - goto out; + return sem; } /* Writers might steal the lock before we grant it to the next reader. @@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) /* A writer stole the lock. Undo our reader grant. */ if (rwsem_atomic_update(-adjustment, sem) & RWSEM_ACTIVE_MASK) - goto out; + return sem; /* Last active locker left. Retry waking readers. */ goto try_reader_grant; } @@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) sem->wait_list.next = next; next->prev = &sem->wait_list; - out: return sem; } -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v6 4/6] rwsem/wake: check lock before do atomic update
Atomic update lock and roll back will cause cache bouncing in large machine. A lock status pre-read can relieve this problem Suggested-by: Davidlohr bueso Suggested-by: Tim Chen Signed-off-by: Alex Shi --- lib/rwsem.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/lib/rwsem.c b/lib/rwsem.c index a8055cf..1d6e6e8 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) struct rwsem_waiter *waiter; struct task_struct *tsk; struct list_head *next; - long oldcount, woken, loop, adjustment; + long woken, loop, adjustment; waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); if (waiter->type == RWSEM_WAITING_FOR_WRITE) { @@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type) if (wake_type != RWSEM_WAKE_READ_OWNED) { adjustment = RWSEM_ACTIVE_READ_BIAS; while (1) { + long oldcount; + + /* A writer stole the lock. */ + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) + return sem; + oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; if (likely(oldcount >= RWSEM_WAITING_BIAS)) -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v6 1/6] rwsem: check the lock before cpmxchg in down_write_trylock
Cmpxchg will cause the cacheline bouning when do the value checking, that cause scalability issue in a large machine (like a 80 core box). So a lock pre-read can relief this contention. Signed-off-by: Alex Shi --- include/asm-generic/rwsem.h |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h index bb1e2cd..5ba80e7 100644 --- a/include/asm-generic/rwsem.h +++ b/include/asm-generic/rwsem.h @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem) static inline int __down_write_trylock(struct rw_semaphore *sem) { - long tmp; + if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE)) + return 0; - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, - RWSEM_ACTIVE_WRITE_BIAS); - return tmp == RWSEM_UNLOCKED_VALUE; + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE; } /* -- 1.7.4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3 v2] mm: Reorg code to allow i_mmap_mutex acquisition to be done by caller of page_referenced & try_to_unmap
We reorganize the page_referenced and try_to_unmap code to determine explicitly if mapping->i_mmap_mutex needs to be acquired. This allows us to acquire the mutex for multiple pages in batch. We can call __page_referenced or __try_to_unmap with the mutex already acquired so the mutex doesn't have to be acquired multiple times. Tim --- Signed-off-by: Tim Chen --- diff --git a/include/linux/rmap.h b/include/linux/rmap.h index fd07c45..f1320b1 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -156,8 +156,11 @@ static inline void page_dup_rmap(struct page *page) /* * Called from mm/vmscan.c to handle paging out */ -int page_referenced(struct page *, int is_locked, - struct mem_cgroup *memcg, unsigned long *vm_flags); +int needs_page_mmap_mutex(struct page *page); +int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, + unsigned long *vm_flags); +int __page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, + unsigned long *vm_flags); int page_referenced_one(struct page *, struct vm_area_struct *, unsigned long address, unsigned int *mapcount, unsigned long *vm_flags); @@ -176,6 +179,7 @@ enum ttu_flags { bool is_vma_temporary_stack(struct vm_area_struct *vma); int try_to_unmap(struct page *, enum ttu_flags flags); +int __try_to_unmap(struct page *, enum ttu_flags flags); int try_to_unmap_one(struct page *, struct vm_area_struct *, unsigned long address, enum ttu_flags flags); diff --git a/mm/rmap.c b/mm/rmap.c index 5b5ad58..8be1799 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -843,8 +843,7 @@ static int page_referenced_file(struct page *page, * so we can safely take mapping->i_mmap_mutex. */ BUG_ON(!PageLocked(page)); - - mutex_lock(&mapping->i_mmap_mutex); + BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex)); /* * i_mmap_mutex does not stabilize mapcount at all, but mapcount @@ -869,21 +868,15 @@ static int page_referenced_file(struct page *page, break; } - mutex_unlock(&mapping->i_mmap_mutex); return referenced; } -/** - * page_referenced - test if the page was referenced - * @page: the page to test - * @is_locked: caller holds lock on the page - * @memcg: target memory cgroup - * @vm_flags: collect encountered vma->vm_flags who actually referenced the page - * - * Quick test_and_clear_referenced for all mappings to a page, - * returns the number of ptes which referenced the page. - */ -int page_referenced(struct page *page, +int needs_page_mmap_mutex(struct page *page) +{ + return page->mapping && !PageKsm(page) && !PageAnon(page); +} + +int __page_referenced(struct page *page, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags) @@ -919,6 +912,32 @@ out: return referenced; } +/** + * page_referenced - test if the page was referenced + * @page: the page to test + * @is_locked: caller holds lock on the page + * @memcg: target memory cgroup + * @vm_flags: collect encountered vma->vm_flags who actually referenced the page + * + * Quick test_and_clear_referenced for all mappings to a page, + * returns the number of ptes which referenced the page. + */ +int page_referenced(struct page *page, + int is_locked, + struct mem_cgroup *memcg, + unsigned long *vm_flags) +{ + int result, needs_lock; + + needs_lock = needs_page_mmap_mutex(page); + if (needs_lock) + mutex_lock(&page->mapping->i_mmap_mutex); + result = __page_referenced(page, is_locked, memcg, vm_flags); + if (needs_lock) + mutex_unlock(&page->mapping->i_mmap_mutex); + return result; +} + static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, unsigned long address) { @@ -1560,7 +1579,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) unsigned long max_nl_size = 0; unsigned int mapcount; - mutex_lock(&mapping->i_mmap_mutex); + BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex)); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); if (address == -EFAULT) @@ -1640,7 +1659,24 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags) list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - mutex_unlock(&mapping->i_mmap_mutex); + return ret; +} + +int __try_to_unmap(struct page *page, enum ttu_flags flags) +{ + int ret; + + BU
[PATCH 3/3 v2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex
In shrink_page_list, call to page_referenced_file and try_to_unmap will cause the acquisition/release of mapping->i_mmap_mutex for each page in the page list. However, it is very likely that successive pages in the list share the same mapping and we can reduce the frequency of i_mmap_mutex acquisition by holding the mutex in shrink_page_list before calling __page_referenced and __try_to_unmap. This improves the performance when the system has a lot page reclamations for file mapped pages if workloads are using a lot of memory for page cache. Tim --- Signed-off-by: Tim Chen Signed-off-by: Matthew Wilcox --- diff --git a/mm/vmscan.c b/mm/vmscan.c index d4ab646..0428639 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -786,7 +786,7 @@ static enum page_references page_check_references(struct page *page, int referenced_ptes, referenced_page; unsigned long vm_flags; - referenced_ptes = page_referenced(page, 1, mz->mem_cgroup, &vm_flags); + referenced_ptes = __page_referenced(page, 1, mz->mem_cgroup, &vm_flags); referenced_page = TestClearPageReferenced(page); /* Lumpy reclaim - ignore references */ @@ -856,6 +856,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; unsigned long nr_writeback = 0; + struct mutex *i_mmap_mutex = NULL; cond_resched(); @@ -909,7 +910,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } + if (needs_page_mmap_mutex(page) && + i_mmap_mutex != &page->mapping->i_mmap_mutex) { + if (i_mmap_mutex) + mutex_unlock(i_mmap_mutex); + i_mmap_mutex = &page->mapping->i_mmap_mutex; + mutex_lock(i_mmap_mutex); + } references = page_check_references(page, mz, sc); + switch (references) { case PAGEREF_ACTIVATE: goto activate_locked; @@ -939,7 +948,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, TTU_UNMAP)) { + switch (__try_to_unmap(page, TTU_UNMAP)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -1090,6 +1099,8 @@ keep_lumpy: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } + if (i_mmap_mutex) + mutex_unlock(i_mmap_mutex); nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages, &free_pages); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3 v2] mm: Batch page reclamation under shink_page_list
This is the second version of the patch series. Thanks to Matthew Wilcox for many valuable suggestions on improving the patches. To do page reclamation in shrink_page_list function, there are two locks taken on a page by page basis. One is the tree lock protecting the radix tree of the page mapping and the other is the mapping->i_mmap_mutex protecting the mapped pages. I try to batch the operations on pages sharing the same lock to reduce lock contentions. The first patch batch the operations protected by tree lock while the second and third patch batch the operations protected by the i_mmap_mutex. I managed to get 14% throughput improvement when with a workload putting heavy pressure of page cache by reading many large mmaped files simultaneously on a 8 socket Westmere server. Tim Signed-off-by: Tim Chen --- Diffstat include/linux/rmap.h |8 +++- mm/rmap.c| 110 ++--- mm/vmscan.c | 113 +- 3 files changed, 185 insertions(+), 46 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list
We gather the pages that need to be unmapped in shrink_page_list. We batch the unmap to reduce the frequency of acquisition of the tree lock protecting the mapping's radix tree. This is possible as successive pages likely share the same mapping in __remove_mapping_batch routine. This avoids excessive cache bouncing of the tree lock when page reclamations are occurring simultaneously. Tim --- Signed-off-by: Tim Chen --- diff --git a/mm/vmscan.c b/mm/vmscan.c index aac5672..d4ab646 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -600,6 +600,85 @@ cannot_free: return 0; } +/* Same as __remove_mapping, but batching operations to minimize locking */ +/* Pages to be unmapped should be locked first */ +static int __remove_mapping_batch(struct list_head *unmap_pages, + struct list_head *ret_pages, + struct list_head *free_pages) +{ + struct address_space *mapping, *next; + LIST_HEAD(swap_pages); + swp_entry_t swap; + struct page *page; + int nr_reclaimed; + + mapping = NULL; + nr_reclaimed = 0; + while (!list_empty(unmap_pages)) { + + page = lru_to_page(unmap_pages); + BUG_ON(!PageLocked(page)); + + list_del(&page->lru); + next = page_mapping(page); + if (mapping != next) { + if (mapping) + spin_unlock_irq(&mapping->tree_lock); + mapping = next; + spin_lock_irq(&mapping->tree_lock); + } + + if (!page_freeze_refs(page, 2)) + goto cannot_free; + if (unlikely(PageDirty(page))) { + page_unfreeze_refs(page, 2); + goto cannot_free; + } + + if (PageSwapCache(page)) { + __delete_from_swap_cache(page); + /* swapcache_free need to be called without tree_lock */ + list_add(&page->lru, &swap_pages); + } else { + void (*freepage)(struct page *); + + freepage = mapping->a_ops->freepage; + + __delete_from_page_cache(page); + mem_cgroup_uncharge_cache_page(page); + + if (freepage != NULL) + freepage(page); + + unlock_page(page); + nr_reclaimed++; + list_add(&page->lru, free_pages); + } + continue; +cannot_free: + unlock_page(page); + list_add(&page->lru, ret_pages); + VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); + + } + + if (mapping) + spin_unlock_irq(&mapping->tree_lock); + + while (!list_empty(&swap_pages)) { + page = lru_to_page(&swap_pages); + list_del(&page->lru); + + swap.val = page_private(page); + swapcache_free(swap, page); + + unlock_page(page); + nr_reclaimed++; + list_add(&page->lru, free_pages); + } + + return nr_reclaimed; +} /* * Attempt to detach a locked page from its ->mapping. If it is dirty or if * someone else has a ref on the page, abort and return 0. If it was @@ -771,6 +850,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(unmap_pages); int pgactivate = 0; unsigned long nr_dirty = 0; unsigned long nr_congested = 0; @@ -969,17 +1049,13 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - if (!mapping || !__remove_mapping(mapping, page)) + if (!mapping) goto keep_locked; - /* -* At this point, we have no other references and there is -* no way to pick any more up (removed from LRU, removed -* from pagecache). Can use non-atomic bitops now (and -* we obviously don't have to worry about waking up a process -* waiting on the page lock, because there are no references. -*/ - __clear_page_locked(page); + /* remove pages from mapping in batch at end of loop */ + list_add(&page->lru, &unmap_pages); + continue; + free_it: nr_reclaimed++; @@ -1014,6 +1090,9 @@ keep_lumpy: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } + nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages, +
Re: [PATCH 00/33] AutoNUMA27
On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote: > Andrew Morton writes: > > > On Thu, 4 Oct 2012 01:50:42 +0200 > > Andrea Arcangeli wrote: > > > >> This is a new AutoNUMA27 release for Linux v3.6. > > > > Peter's numa/sched patches have been in -next for a week. > > Did they pass review? I have some doubts. > > The last time I looked it also broke numactl. > > > Guys, what's the plan here? > > Since they are both performance features their ultimate benefit > is how much faster they make things (and how seldom they make things > slower) > > IMHO needs a performance shot-out. Run both on the same 10 workloads > and see who wins. Just a lot of of work. Any volunteers? > > For a change like this I think less regression is actually more > important than the highest peak numbers. > > -Andi > I remembered that 3 months ago when Alex tested the numa/sched patches there were 20% regression on SpecJbb2005 due to the numa balancer. Those issues may have been fixed but we probably need to run this benchmark against the latest. For most of the other kernel performance workloads we ran we didn't see much changes. Maurico has a different config for this benchmark and it will be nice if he can also check to see if there are any performance changes on his side. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] Patchset to use PCLMULQDQ to accelerate CRC-T10DIF checksum computation
Herbert, Currently the CRC-T10DIF checksum is computed using a generic table lookup algorithm. By switching the checksum to PCLMULQDQ based computation, we can speedup the computation by 8x for checksumming 512 bytes and even more for larger buffer size. This will improve performance of SCSI drivers turning on the CRC-T10IDF checksum. In our SSD based experiments, we have seen in disk throughput by 3.5x with T10DIF. This patchset provide the x86_64 routine using PCLMULQDQ instruction and switch the crc_t10dif library function to use the faster PCLMULQDQ based routine when available. Will appreciate if you can consider merging this for the 3.10 kernel. Tim Tim Chen (4): Wrap crc_t10dif function all to use crypto transform framework Accelerated CRC T10 DIF computation with PCLMULQDQ instruction Glue code to cast accelerated CRCT10DIF assembly as a crypto transform Simple correctness and speed test for CRCT10DIF hash arch/x86/crypto/Makefile| 2 + arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 arch/x86/crypto/crct10dif-pclmul_glue.c | 153 crypto/Kconfig | 21 + crypto/tcrypt.c | 8 + crypto/testmgr.c| 10 + crypto/testmgr.h| 24 ++ include/linux/crc-t10dif.h | 10 + lib/crc-t10dif.c| 96 + 9 files changed, 983 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash
These are simple tests to do sanity check of CRC T10 DIF hash. The correctness of the transform can be checked with the command modprobe tcrypt mode=47 The speed of the transform can be evaluated with the command modprobe tcrypt mode=320 Set the cpu frequency to constant and turn turbo off when running the speed test so the frequency governor will not tweak the frequency and affects the measurements. Signed-off-by: Tim Chen Tested-by: Keith Busch --- crypto/tcrypt.c | 8 crypto/testmgr.c | 10 ++ crypto/testmgr.h | 24 3 files changed, 42 insertions(+) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 24ea7df..5e95722 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -1174,6 +1174,10 @@ static int do_test(int m) ret += tcrypt_test("ghash"); break; + case 47: + ret += tcrypt_test("crct10dif"); + break; + case 100: ret += tcrypt_test("hmac(md5)"); break; @@ -1498,6 +1502,10 @@ static int do_test(int m) test_hash_speed("crc32c", sec, generic_hash_speed_template); if (mode > 300 && mode < 400) break; + case 320: + test_hash_speed("crct10dif", sec, generic_hash_speed_template); + if (mode > 300 && mode < 400) break; + case 399: break; diff --git a/crypto/testmgr.c b/crypto/testmgr.c index 3807084..b165316 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1944,6 +1944,16 @@ static const struct alg_test_desc alg_test_descs[] = { } } }, { + .alg = "crct10dif", + .test = alg_test_hash, + .fips_allowed = 1, + .suite = { + .hash = { + .vecs = crct10dif_tv_template, + .count = CRCT10DIF_TEST_VECTORS + } + } + }, { .alg = "cryptd(__driver-cbc-aes-aesni)", .test = alg_test_null, .fips_allowed = 1, diff --git a/crypto/testmgr.h b/crypto/testmgr.h index d503660..6262b74 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -450,6 +450,30 @@ static struct hash_testvec rmd320_tv_template[] = { } }; +#define CRCT10DIF_TEST_VECTORS 2 +static struct hash_testvec crct10dif_tv_template[] = { + { + .plaintext = "abc", + .psize = 3, +#ifdef __LITTLE_ENDIAN + .digest = "\x3b\x44", +#else + .digest = "\x44\x3b", +#endif + }, { + .plaintext = + "abcd", + .psize = 56, +#ifdef __LITTLE_ENDIAN + .digest = "\xe3\x9c", +#else + .digest = "\x9c\xe3", +#endif + .np = 2, + .tap= { 28, 28 } + } +}; + /* * SHA1 test vectors from from FIPS PUB 180-1 * Long vector from CAVS 5.0 -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] Glue code to cast accelerated CRCT10DIF assembly as a crypto transform
Glue code that plugs the PCLMULQDQ accelerated CRC T10 DIF hash into the crypto framework. The config CRYPTO_CRCT10DIF_PCLMUL should be turned on to enable the feature. The crc_t10dif crypto library function will use this faster algorithm when crct10dif_pclmul module is loaded. Signed-off-by: Tim Chen Tested-by: Keith Busch --- arch/x86/crypto/Makefile| 2 + arch/x86/crypto/crct10dif-pclmul_glue.c | 153 crypto/Kconfig | 21 + 3 files changed, 176 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 03cd731..d544a66 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -27,6 +27,7 @@ obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o +obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o # These modules require assembler to support AVX. ifeq ($(avx_supported),yes) @@ -70,3 +71,4 @@ crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o +crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c new file mode 100644 index 000..e87f8d8 --- /dev/null +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c @@ -0,0 +1,153 @@ +/* + * Cryptographic API. + * + * T10 Data Integrity Field CRC16 Crypto Xform using PCLMULQDQ Instructions + * + * Copyright (C) 2013 Intel Corporation + * Author: Tim Chen + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +asmlinkage __u16 crc_t10dif_pcl(__u16 crc, const unsigned char *buf, + size_t len); + +struct chksum_desc_ctx { + __u16 crc; +}; + +/* + * Steps through buffer one byte at at time, calculates reflected + * crc using table. + */ + +static int chksum_init(struct shash_desc *desc) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + ctx->crc = 0; + + return 0; +} + +static int chksum_update(struct shash_desc *desc, const u8 *data, +unsigned int length) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + if (irq_fpu_usable()) { + kernel_fpu_begin(); + ctx->crc = crc_t10dif_pcl(ctx->crc, data, length); + kernel_fpu_end(); + } else + ctx->crc = crc_t10dif_generic(ctx->crc, data, length); + return 0; +} + +static int chksum_final(struct shash_desc *desc, u8 *out) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + *(__u16 *)out = ctx->crc; + return 0; +} + +static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len, + u8 *out) +{ + if (irq_fpu_usable()) { + kernel_fpu_begin(); + *(__u16 *)out = crc_t10dif_pcl(*crcp, data, len); + kernel_fpu_end(); + } else + *(__u16 *)out = crc_t10dif_generic(*crcp, data, len); + return 0; +} + +static int chksum_finup(struct shash_desc *desc, const u8 *data, + unsigned int len, u8 *out) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + return __chksum_finup(&ctx->crc, data, len, out); +} + +static int chksum_digest(struct shash_desc *desc, const u8 *data, +unsigned int length, u8 *out) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + return __chksum_finup(&ctx->crc, data, length, out); +} + +static struct shash_alg alg = { + .digestsize = CRC_T10DIF_DIGEST_SIZE, + .init = chksum_init, + .update = chksum_update, + .final
[PATCH 1/4] Wrap crc_t10dif function all to use crypto transform framework
When CRC T10 DIF is calculated using the crypto transform framework, we wrap the crc_t10dif function call to utilize it. This allows us to take advantage of any accelerated CRC T10 DIF transform that is plugged into the crypto framework. Signed-off-by: Tim Chen Tested-by: Keith Busch --- include/linux/crc-t10dif.h | 10 + lib/crc-t10dif.c | 96 ++ 2 files changed, 106 insertions(+) diff --git a/include/linux/crc-t10dif.h b/include/linux/crc-t10dif.h index a9c96d8..f0eb4d5 100644 --- a/include/linux/crc-t10dif.h +++ b/include/linux/crc-t10dif.h @@ -3,6 +3,16 @@ #include +#ifdef CONFIG_CRYPTO_CRCT10DIF + +#define CRC_T10DIF_DIGEST_SIZE 2 +#define CRC_T10DIF_BLOCK_SIZE 1 + +__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len); +void crc_t10dif_update_lib(void); + +#endif /* CONFIG_CRYPTO_CRCT10DIF */ + __u16 crc_t10dif(unsigned char const *, size_t); #endif diff --git a/lib/crc-t10dif.c b/lib/crc-t10dif.c index fbbd66e..41f469a 100644 --- a/lib/crc-t10dif.c +++ b/lib/crc-t10dif.c @@ -11,6 +11,9 @@ #include #include #include +#include +#include +#include /* Table generated using the following polynomium: * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1 @@ -51,6 +54,98 @@ static const __u16 t10_dif_crc_table[256] = { 0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3 }; +#ifdef CONFIG_CRYPTO_CRCT10DIF + +static struct crypto_shash *crct10dif_tfm; + +/* flag to indicate update to new algorithm in progress*/ +static bool crc_t10dif_newalg; + +__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len) +{ + unsigned int i; + + for (i = 0 ; i < len ; i++) + crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff]; + + return crc; +} +EXPORT_SYMBOL(crc_t10dif_generic); + +/* + * If we have defined crypto transform for CRC-T10DIF, use that instead. + * This allows us to plug in fast version of CRC-T10DIF when available. + */ + +void crc_t10dif_update_lib() +{ + struct crypto_shash *old_tfm, *new_tfm; + + old_tfm = crct10dif_tfm; + crc_t10dif_newalg = true; + /* make sure new alg flag is turned on before starting to switch tfm */ + mb(); + + new_tfm = crypto_alloc_shash("crct10dif", 0, 0); + if (IS_ERR(new_tfm)) + goto done; + + if (old_tfm) + if (crypto_tfm_alg_priority(&old_tfm->base) >= + crypto_tfm_alg_priority(&new_tfm->base)) { + crypto_free_shash(new_tfm); + goto done; + } + crct10dif_tfm = new_tfm; + /* make sure update to tfm pointer is completed */ + mb(); + crypto_free_shash(old_tfm); + +done: + crc_t10dif_newalg = false; +} +EXPORT_SYMBOL(crc_t10dif_update_lib); + +__u16 crc_t10dif(const unsigned char *buffer, size_t len) +{ + struct { + struct shash_desc shash; + char ctx[2]; + } desc; + int err; + + /* plugging in new alg or not using a tfm? */ + if (unlikely(crc_t10dif_newalg) || (!crct10dif_tfm)) + return crc_t10dif_generic(0, buffer, len); + + desc.shash.tfm = crct10dif_tfm; + desc.shash.flags = 0; + *(__u16 *)desc.ctx = 0; + + err = crypto_shash_update(&desc.shash, buffer, len); + BUG_ON(err); + + return *(__u16 *)desc.ctx; +} +EXPORT_SYMBOL(crc_t10dif); + +static int __init crc_t10dif_mod_init(void) +{ + crct10dif_tfm = NULL; + crc_t10dif_newalg = false; + return 0; +} + +static void __exit crc_t10dif_mod_fini(void) +{ + if (crct10dif_tfm) + crypto_free_shash(crct10dif_tfm); +} + +module_init(crc_t10dif_mod_init); +module_exit(crc_t10dif_mod_fini); + +#else __u16 crc_t10dif(const unsigned char *buffer, size_t len) { __u16 crc = 0; @@ -62,6 +157,7 @@ __u16 crc_t10dif(const unsigned char *buffer, size_t len) return crc; } EXPORT_SYMBOL(crc_t10dif); +#endif MODULE_DESCRIPTION("T10 DIF CRC calculation"); MODULE_LICENSE("GPL"); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ instructions. Details discussing the implementation can be found in the paper: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" URL: http://download.intel.com/design/intarch/papers/323102.pdf Signed-off-by: Tim Chen Tested-by: Keith Busch --- arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 + 1 file changed, 659 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S diff --git a/arch/x86/crypto/crct10dif-pcl-asm_64.S b/arch/x86/crypto/crct10dif-pcl-asm_64.S new file mode 100644 index 000..c497d0e --- /dev/null +++ b/arch/x86/crypto/crct10dif-pcl-asm_64.S @@ -0,0 +1,659 @@ + +# Implement fast CRC-T10DIF computation with SSE and PCLMULQDQ instructions +# +# Copyright (c) 2013, Intel Corporation +# +# Authors: +# Erdinc Ozturk +# Vinodh Gopal +# James Guilford +# Tim Chen +# +# This software is available to you under a choice of one of two +# licenses. You may choose to be licensed under the terms of the GNU +# General Public License (GPL) Version 2, available from the file +# COPYING in the main directory of this source tree, or the +# OpenIB.org BSD license below: +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the +# distribution. +# +# * Neither the name of the Intel Corporation nor the names of its +# contributors may be used to endorse or promote products derived from +# this software without specific prior written permission. +# +# +# THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION ""AS IS"" AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES# LOSS OF USE, DATA, OR +# PROFITS# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +# Function API: +# UINT16 crc_t10dif_pcl( +# UINT16 init_crc, //initial CRC value, 16 bits +# const unsigned char *buf, //buffer pointer to calculate CRC on +# UINT64 len //buffer length in bytes (64-bit data) +# )# +# +# Reference paper titled "Fast CRC Computation for Generic +# Polynomials Using PCLMULQDQ Instruction" +# URL: http://download.intel.com/design/intarch/papers/323102.pdf +# +# + +#include + +.text + +#definearg1 %rdi +#definearg2 %rsi +#definearg3 %rdx + +#definearg1_low32 %edi + +ENTRY(crc_t10dif_pcl) +.align 16 + + # adjust the 16-bit initial_crc value, scale it to 32 bits + shl $16, arg1_low32 + + # Allocate Stack Space + mov %rsp, %rcx + sub $16*10, %rsp + and $~(0x20 - 1), %rsp + + # push the xmm registers into the stack to maintain + movdqa %xmm10, 16*2(%rsp) + movdqa %xmm11, 16*3(%rsp) + movdqa %xmm8 , 16*4(%rsp) + movdqa %xmm12, 16*5(%rsp) + movdqa %xmm13, 16*6(%rsp) + movdqa %xmm6, 16*7(%rsp) + movdqa %xmm7, 16*8(%rsp) + movdqa %xmm9, 16*9(%rsp) + + + # check if smaller than 256 + cmp $256, arg3 + + # for sizes less than 128, we can't fold 64B at a time... + jl _less_than_128 + + + # load the initial crc value + movdarg1_low32, %xmm10 # initial crc + + # crc value does not need to be byte-reflected, but it needs + # to be moved to the high part of the register. + # because data will be byte-reflected and will align with + # initial crc at correct place. + pslldq $12, %xmm10 + + movdqa SHUF_MASK(%rip), %xmm11 + # receive the initial 64B data, xor the initial crc value + movdqu 16*0(arg2), %xmm0 + movdqu 16*1(arg2), %xmm1 + movdqu 16*2(arg2), %xmm2 + movdqu 16*3(arg2), %xmm3 + movdqu 16*4(arg2), %xmm4 + movdqu 16*5(arg2), %xmm5 + m
Re: [PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash
On Wed, 2013-04-17 at 20:58 +0300, Jussi Kivilinna wrote: > On 16.04.2013 19:20, Tim Chen wrote: > > These are simple tests to do sanity check of CRC T10 DIF hash. The > > correctness of the transform can be checked with the command > > modprobe tcrypt mode=47 > > The speed of the transform can be evaluated with the command > > modprobe tcrypt mode=320 > > > > Set the cpu frequency to constant and turn turbo off when running the > > speed test so the frequency governor will not tweak the frequency and > > affects the measurements. > > > > Signed-off-by: Tim Chen > > Tested-by: Keith Busch > > > > > +#define CRCT10DIF_TEST_VECTORS 2 > > +static struct hash_testvec crct10dif_tv_template[] = { > > + { > > + .plaintext = "abc", > > + .psize = 3, > > +#ifdef __LITTLE_ENDIAN > > + .digest = "\x3b\x44", > > +#else > > + .digest = "\x44\x3b", > > +#endif > > + }, { > > + .plaintext = > > + "abcd", > > + .psize = 56, > > +#ifdef __LITTLE_ENDIAN > > + .digest = "\xe3\x9c", > > +#else > > + .digest = "\x9c\xe3", > > +#endif > > + .np = 2, > > + .tap= { 28, 28 } > > + } > > +}; > > + > > Are these large enough to test all code paths in the PCLMULQDQ implementation? We have tested the implementation exhaustively for large number of code paths. The tcrypt tests here are only meant for sanity check, and not for exhaustive tests. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
On Wed, 2013-04-17 at 20:58 +0300, Jussi Kivilinna wrote: > On 16.04.2013 19:20, Tim Chen wrote: > > This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ > > instructions. Details discussing the implementation can be found in the > > paper: > > > > "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" > > URL: http://download.intel.com/design/intarch/papers/323102.pdf > > URL does not work. Thanks for catching this. Will update. > > > > > Signed-off-by: Tim Chen > > Tested-by: Keith Busch > > --- > > arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 > > + > > 1 file changed, 659 insertions(+) > > create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S > > > + > > + # Allocate Stack Space > > + mov %rsp, %rcx > > + sub $16*10, %rsp > > + and $~(0x20 - 1), %rsp > > + > > + # push the xmm registers into the stack to maintain > > + movdqa %xmm10, 16*2(%rsp) > > + movdqa %xmm11, 16*3(%rsp) > > + movdqa %xmm8 , 16*4(%rsp) > > + movdqa %xmm12, 16*5(%rsp) > > + movdqa %xmm13, 16*6(%rsp) > > + movdqa %xmm6, 16*7(%rsp) > > + movdqa %xmm7, 16*8(%rsp) > > + movdqa %xmm9, 16*9(%rsp) > > You don't need to store (and restore) these, as 'crc_t10dif_pcl' is called > between kernel_fpu_begin/_end. That's true. Will skip the xmm save/restore in update to the patch. > > > + > > + > > + # check if smaller than 256 > > + cmp $256, arg3 > > + > > > +_cleanup: > > + # scale the result back to 16 bits > > + shr $16, %eax > > + movdqa 16*2(%rsp), %xmm10 > > + movdqa 16*3(%rsp), %xmm11 > > + movdqa 16*4(%rsp), %xmm8 > > + movdqa 16*5(%rsp), %xmm12 > > + movdqa 16*6(%rsp), %xmm13 > > + movdqa 16*7(%rsp), %xmm6 > > + movdqa 16*8(%rsp), %xmm7 > > + movdqa 16*9(%rsp), %xmm9 > > Registers are overwritten by kernel_fpu_end. > > > + mov %rcx, %rsp > > + ret > > +ENDPROC(crc_t10dif_pcl) > > + > > You should move ENDPROC at end of the full function. > > > + > > + > > +.align 16 > > +_less_than_128: > > + > > + # check if there is enough buffer to be able to fold 16B at a time > > + cmp $32, arg3 > > > + movdqa (%rsp), %xmm7 > > + pshufb %xmm11, %xmm7 > > + pxor%xmm0 , %xmm7 # xor the initial crc value > > + > > + psrldq $7, %xmm7 > > + > > + jmp _barrett > > Move ENDPROC here. > Will do. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/4] Wrap crc_t10dif function all to use crypto transform framework
When CRC T10 DIF is calculated using the crypto transform framework, we wrap the crc_t10dif function call to utilize it. This allows us to take advantage of any accelerated CRC T10 DIF transform that is plugged into the crypto framework. Signed-off-by: Tim Chen --- include/linux/crc-t10dif.h | 10 + lib/crc-t10dif.c | 96 ++ 2 files changed, 106 insertions(+) diff --git a/include/linux/crc-t10dif.h b/include/linux/crc-t10dif.h index a9c96d8..f0eb4d5 100644 --- a/include/linux/crc-t10dif.h +++ b/include/linux/crc-t10dif.h @@ -3,6 +3,16 @@ #include +#ifdef CONFIG_CRYPTO_CRCT10DIF + +#define CRC_T10DIF_DIGEST_SIZE 2 +#define CRC_T10DIF_BLOCK_SIZE 1 + +__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len); +void crc_t10dif_update_lib(void); + +#endif /* CONFIG_CRYPTO_CRCT10DIF */ + __u16 crc_t10dif(unsigned char const *, size_t); #endif diff --git a/lib/crc-t10dif.c b/lib/crc-t10dif.c index fbbd66e..41f469a 100644 --- a/lib/crc-t10dif.c +++ b/lib/crc-t10dif.c @@ -11,6 +11,9 @@ #include #include #include +#include +#include +#include /* Table generated using the following polynomium: * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1 @@ -51,6 +54,98 @@ static const __u16 t10_dif_crc_table[256] = { 0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3 }; +#ifdef CONFIG_CRYPTO_CRCT10DIF + +static struct crypto_shash *crct10dif_tfm; + +/* flag to indicate update to new algorithm in progress*/ +static bool crc_t10dif_newalg; + +__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len) +{ + unsigned int i; + + for (i = 0 ; i < len ; i++) + crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff]; + + return crc; +} +EXPORT_SYMBOL(crc_t10dif_generic); + +/* + * If we have defined crypto transform for CRC-T10DIF, use that instead. + * This allows us to plug in fast version of CRC-T10DIF when available. + */ + +void crc_t10dif_update_lib() +{ + struct crypto_shash *old_tfm, *new_tfm; + + old_tfm = crct10dif_tfm; + crc_t10dif_newalg = true; + /* make sure new alg flag is turned on before starting to switch tfm */ + mb(); + + new_tfm = crypto_alloc_shash("crct10dif", 0, 0); + if (IS_ERR(new_tfm)) + goto done; + + if (old_tfm) + if (crypto_tfm_alg_priority(&old_tfm->base) >= + crypto_tfm_alg_priority(&new_tfm->base)) { + crypto_free_shash(new_tfm); + goto done; + } + crct10dif_tfm = new_tfm; + /* make sure update to tfm pointer is completed */ + mb(); + crypto_free_shash(old_tfm); + +done: + crc_t10dif_newalg = false; +} +EXPORT_SYMBOL(crc_t10dif_update_lib); + +__u16 crc_t10dif(const unsigned char *buffer, size_t len) +{ + struct { + struct shash_desc shash; + char ctx[2]; + } desc; + int err; + + /* plugging in new alg or not using a tfm? */ + if (unlikely(crc_t10dif_newalg) || (!crct10dif_tfm)) + return crc_t10dif_generic(0, buffer, len); + + desc.shash.tfm = crct10dif_tfm; + desc.shash.flags = 0; + *(__u16 *)desc.ctx = 0; + + err = crypto_shash_update(&desc.shash, buffer, len); + BUG_ON(err); + + return *(__u16 *)desc.ctx; +} +EXPORT_SYMBOL(crc_t10dif); + +static int __init crc_t10dif_mod_init(void) +{ + crct10dif_tfm = NULL; + crc_t10dif_newalg = false; + return 0; +} + +static void __exit crc_t10dif_mod_fini(void) +{ + if (crct10dif_tfm) + crypto_free_shash(crct10dif_tfm); +} + +module_init(crc_t10dif_mod_init); +module_exit(crc_t10dif_mod_fini); + +#else __u16 crc_t10dif(const unsigned char *buffer, size_t len) { __u16 crc = 0; @@ -62,6 +157,7 @@ __u16 crc_t10dif(const unsigned char *buffer, size_t len) return crc; } EXPORT_SYMBOL(crc_t10dif); +#endif MODULE_DESCRIPTION("T10 DIF CRC calculation"); MODULE_LICENSE("GPL"); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/4] Patchset to use PCLMULQDQ to accelerate CRC-T10DIF checksum computation
Currently the CRC-T10DIF checksum is computed using a generic table lookup algorithm. By switching the checksum to PCLMULQDQ based computation, we can speedup the computation by 8x for checksumming 512 bytes and even more for larger buffer size. This will improve performance of SCSI drivers turning on the CRC-T10IDF checksum. In our SSD based experiments, we have seen increase disk throughput by 3.5x with T10DIF for 512 byte block size. This patch set provides the x86_64 routine using PCLMULQDQ instruction and switches the crc_t10dif library function to use the faster PCLMULQDQ based routine when available. Tim v1->v2 1. Get rid of unnecessary xmm registers save and restore and fix ENDPROC position in PCLMULQDQ version of crc t10dif computation. 2. Fix URL to paper reference of CRC computation with PCLMULQDQ. 3. Add one additional tcrypt test case to exercise more code paths through crc t10dif computation. 4. Fix config dependencies of CRYPTO_CRCT10DIF. Thanks to Matthew and Jussi who reviewed the patches and Keith for testing version 1 of the patch set. Tim Chen (4): Wrap crc_t10dif function all to use crypto transform framework Accelerated CRC T10 DIF computation with PCLMULQDQ instruction Glue code to cast accelerated CRCT10DIF assembly as a crypto transform Simple correctness and speed test for CRCT10DIF hash arch/x86/crypto/Makefile| 2 + arch/x86/crypto/crct10dif-pcl-asm_64.S | 643 arch/x86/crypto/crct10dif-pclmul_glue.c | 153 crypto/Kconfig | 21 ++ crypto/tcrypt.c | 8 + crypto/testmgr.c| 10 + crypto/testmgr.h| 33 ++ include/linux/crc-t10dif.h | 10 + lib/crc-t10dif.c| 96 + 9 files changed, 976 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/4] Glue code to cast accelerated CRCT10DIF assembly as a crypto transform
Glue code that plugs the PCLMULQDQ accelerated CRC T10 DIF hash into the crypto framework. The config CRYPTO_CRCT10DIF_PCLMUL should be turned on to enable the feature. The crc_t10dif crypto library function will use this faster algorithm when crct10dif_pclmul module is loaded. Signed-off-by: Tim Chen --- arch/x86/crypto/Makefile| 2 + arch/x86/crypto/crct10dif-pclmul_glue.c | 153 crypto/Kconfig | 21 + 3 files changed, 176 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 03cd731..d544a66 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -27,6 +27,7 @@ obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o +obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o # These modules require assembler to support AVX. ifeq ($(avx_supported),yes) @@ -70,3 +71,4 @@ crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o +crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c new file mode 100644 index 000..e87f8d8 --- /dev/null +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c @@ -0,0 +1,153 @@ +/* + * Cryptographic API. + * + * T10 Data Integrity Field CRC16 Crypto Xform using PCLMULQDQ Instructions + * + * Copyright (C) 2013 Intel Corporation + * Author: Tim Chen + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +asmlinkage __u16 crc_t10dif_pcl(__u16 crc, const unsigned char *buf, + size_t len); + +struct chksum_desc_ctx { + __u16 crc; +}; + +/* + * Steps through buffer one byte at at time, calculates reflected + * crc using table. + */ + +static int chksum_init(struct shash_desc *desc) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + ctx->crc = 0; + + return 0; +} + +static int chksum_update(struct shash_desc *desc, const u8 *data, +unsigned int length) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + if (irq_fpu_usable()) { + kernel_fpu_begin(); + ctx->crc = crc_t10dif_pcl(ctx->crc, data, length); + kernel_fpu_end(); + } else + ctx->crc = crc_t10dif_generic(ctx->crc, data, length); + return 0; +} + +static int chksum_final(struct shash_desc *desc, u8 *out) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + *(__u16 *)out = ctx->crc; + return 0; +} + +static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len, + u8 *out) +{ + if (irq_fpu_usable()) { + kernel_fpu_begin(); + *(__u16 *)out = crc_t10dif_pcl(*crcp, data, len); + kernel_fpu_end(); + } else + *(__u16 *)out = crc_t10dif_generic(*crcp, data, len); + return 0; +} + +static int chksum_finup(struct shash_desc *desc, const u8 *data, + unsigned int len, u8 *out) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + return __chksum_finup(&ctx->crc, data, len, out); +} + +static int chksum_digest(struct shash_desc *desc, const u8 *data, +unsigned int length, u8 *out) +{ + struct chksum_desc_ctx *ctx = shash_desc_ctx(desc); + + return __chksum_finup(&ctx->crc, data, length, out); +} + +static struct shash_alg alg = { + .digestsize = CRC_T10DIF_DIGEST_SIZE, + .init = chksum_init, + .update = chksum_update, + .final = chksum_final, +
[PATCH v2 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ instructions. Details discussing the implementation can be found in the paper: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf Signed-off-by: Tim Chen --- arch/x86/crypto/crct10dif-pcl-asm_64.S | 643 + 1 file changed, 643 insertions(+) create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S diff --git a/arch/x86/crypto/crct10dif-pcl-asm_64.S b/arch/x86/crypto/crct10dif-pcl-asm_64.S new file mode 100644 index 000..35e9756 --- /dev/null +++ b/arch/x86/crypto/crct10dif-pcl-asm_64.S @@ -0,0 +1,643 @@ + +# Implement fast CRC-T10DIF computation with SSE and PCLMULQDQ instructions +# +# Copyright (c) 2013, Intel Corporation +# +# Authors: +# Erdinc Ozturk +# Vinodh Gopal +# James Guilford +# Tim Chen +# +# This software is available to you under a choice of one of two +# licenses. You may choose to be licensed under the terms of the GNU +# General Public License (GPL) Version 2, available from the file +# COPYING in the main directory of this source tree, or the +# OpenIB.org BSD license below: +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the +# distribution. +# +# * Neither the name of the Intel Corporation nor the names of its +# contributors may be used to endorse or promote products derived from +# this software without specific prior written permission. +# +# +# THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION ""AS IS"" AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +# Function API: +# UINT16 crc_t10dif_pcl( +# UINT16 init_crc, //initial CRC value, 16 bits +# const unsigned char *buf, //buffer pointer to calculate CRC on +# UINT64 len //buffer length in bytes (64-bit data) +# ); +# +# Reference paper titled "Fast CRC Computation for Generic +# Polynomials Using PCLMULQDQ Instruction" +# URL: http://www.intel.com/content/dam/www/public/us/en/documents +# /white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf +# +# + +#include + +.text + +#definearg1 %rdi +#definearg2 %rsi +#definearg3 %rdx + +#definearg1_low32 %edi + +ENTRY(crc_t10dif_pcl) +.align 16 + + # adjust the 16-bit initial_crc value, scale it to 32 bits + shl $16, arg1_low32 + + # Allocate Stack Space + mov %rsp, %rcx + sub $16*2, %rsp + # align stack to 16 byte boundary + and $~(0x10 - 1), %rsp + + # check if smaller than 256 + cmp $256, arg3 + + # for sizes less than 128, we can't fold 64B at a time... + jl _less_than_128 + + + # load the initial crc value + movdarg1_low32, %xmm10 # initial crc + + # crc value does not need to be byte-reflected, but it needs + # to be moved to the high part of the register. + # because data will be byte-reflected and will align with + # initial crc at correct place. + pslldq $12, %xmm10 + + movdqa SHUF_MASK(%rip), %xmm11 + # receive the initial 64B data, xor the initial crc value + movdqu 16*0(arg2), %xmm0 + movdqu 16*1(arg2), %xmm1 + movdqu 16*2(arg2), %xmm2 + movdqu 16*3(arg2), %xmm3 + movdqu 16*4(arg2), %xmm4 + movdqu 16*5(arg2), %xmm5 + movdqu 16*6(arg2), %xmm6 + movdqu 16*7(arg2), %xmm7 + + pshufb %xmm11, %xmm0 + # XOR the initial_crc value + pxor%xmm10, %xmm0 + pshufb
[PATCH v2 4/4] Simple correctness and speed test for CRCT10DIF hash
These are simple tests to do sanity check of CRC T10 DIF hash. The correctness of the transform can be checked with the command modprobe tcrypt mode=47 The speed of the transform can be evaluated with the command modprobe tcrypt mode=320 Set the cpu frequency to constant and turn turbo off when running the speed test so the frequency governor will not tweak the frequency and affects the measurements. Signed-off-by: Tim Chen --- crypto/tcrypt.c | 8 crypto/testmgr.c | 10 ++ crypto/testmgr.h | 33 + 3 files changed, 51 insertions(+) diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c index 24ea7df..5e95722 100644 --- a/crypto/tcrypt.c +++ b/crypto/tcrypt.c @@ -1174,6 +1174,10 @@ static int do_test(int m) ret += tcrypt_test("ghash"); break; + case 47: + ret += tcrypt_test("crct10dif"); + break; + case 100: ret += tcrypt_test("hmac(md5)"); break; @@ -1498,6 +1502,10 @@ static int do_test(int m) test_hash_speed("crc32c", sec, generic_hash_speed_template); if (mode > 300 && mode < 400) break; + case 320: + test_hash_speed("crct10dif", sec, generic_hash_speed_template); + if (mode > 300 && mode < 400) break; + case 399: break; diff --git a/crypto/testmgr.c b/crypto/testmgr.c index 3807084..b165316 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1944,6 +1944,16 @@ static const struct alg_test_desc alg_test_descs[] = { } } }, { + .alg = "crct10dif", + .test = alg_test_hash, + .fips_allowed = 1, + .suite = { + .hash = { + .vecs = crct10dif_tv_template, + .count = CRCT10DIF_TEST_VECTORS + } + } + }, { .alg = "cryptd(__driver-cbc-aes-aesni)", .test = alg_test_null, .fips_allowed = 1, diff --git a/crypto/testmgr.h b/crypto/testmgr.h index d503660..56916d0 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -450,6 +450,39 @@ static struct hash_testvec rmd320_tv_template[] = { } }; +#define CRCT10DIF_TEST_VECTORS 3 +static struct hash_testvec crct10dif_tv_template[] = { + { + .plaintext = "abc", + .psize = 3, +#ifdef __LITTLE_ENDIAN + .digest = "\x3b\x44", +#else + .digest = "\x44\x3b", +#endif + }, { + .plaintext = "1234567890123456789012345678901234567890" +"123456789012345678901234567890123456789", + .psize = 79, +#ifdef __LITTLE_ENDIAN + .digest = "\x70\x4b", +#else + .digest = "\x4b\x70", +#endif + }, { + .plaintext = + "abcd", + .psize = 56, +#ifdef __LITTLE_ENDIAN + .digest = "\xe3\x9c", +#else + .digest = "\x9c\xe3", +#endif + .np = 2, + .tap= { 28, 28 } + } +}; + /* * SHA1 test vectors from from FIPS PUB 180-1 * Long vector from CAVS 5.0 -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix prototype definitions of sha256_transform_asm, sha512_transform_asm
Herbert, This is a follow on patch to the optimized sha256 and sha512 patch series that's just merged into the crypto-dev. Let me know if you prefer me to respin the patch series. This patch corrects the prototype of sha256_transform_asm and sha512_transform_asm function pointer declaration to static. It also fixes a typo in sha512_ssse3_final function that affects the computation of upper 64 bits of the buffer size. Thanks. Tim Signed-off-by: Tim Chen --- arch/x86/crypto/sha256_ssse3_glue.c | 2 +- arch/x86/crypto/sha512_ssse3_glue.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/crypto/sha256_ssse3_glue.c b/arch/x86/crypto/sha256_ssse3_glue.c index fa65453..8a0b711 100644 --- a/arch/x86/crypto/sha256_ssse3_glue.c +++ b/arch/x86/crypto/sha256_ssse3_glue.c @@ -53,7 +53,7 @@ asmlinkage void sha256_transform_rorx(const char *data, u32 *digest, u64 rounds); #endif -asmlinkage void (*sha256_transform_asm)(const char *, u32 *, u64); +static void (*sha256_transform_asm)(const char *, u32 *, u64); static int sha256_ssse3_init(struct shash_desc *desc) diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c index 295f790..3c844f2 100644 --- a/arch/x86/crypto/sha512_ssse3_glue.c +++ b/arch/x86/crypto/sha512_ssse3_glue.c @@ -52,7 +52,7 @@ asmlinkage void sha512_transform_rorx(const char *data, u64 *digest, u64 rounds); #endif -asmlinkage void (*sha512_transform_asm)(const char *, u64 *, u64); +static void (*sha512_transform_asm)(const char *, u64 *, u64); static int sha512_ssse3_init(struct shash_desc *desc) @@ -141,7 +141,7 @@ static int sha512_ssse3_final(struct shash_desc *desc, u8 *out) /* save number of bits */ bits[1] = cpu_to_be64(sctx->count[0] << 3); - bits[0] = cpu_to_be64(sctx->count[1] << 3) | sctx->count[0] >> 61; + bits[0] = cpu_to_be64(sctx->count[1] << 3 | sctx->count[0] >> 61); /* Pad out to 112 mod 128 and append length */ index = sctx->count[0] & 0x7f; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Performance regression from switching lock to rw-sem for anon-vma tree
Ingo, At the time of switching the anon-vma tree's lock from mutex to rw-sem (commit 5a505085), we encountered regressions for fork heavy workload. A lot of optimizations to rw-sem (e.g. lock stealing) helped to mitigate the problem. I tried an experiment on the 3.10-rc4 kernel to compare the performance of rw-sem to one that uses mutex. I saw a 8% regression in throughput for rw-sem vs a mutex implementation in 3.10-rc4. For the experiments, I used the exim mail server workload in the MOSBENCH test suite on 4 socket (westmere) and a 4 socket (ivy bridge) with the number of clients sending mail equal to number of cores. The mail server will fork off a process to handle an incoming mail and put it into mail spool. The lock protecting the anon-vma tree is stressed due to heavy forking. On both machines, I saw that the mutex implementation has 8% more throughput. I've pinned the cpu frequency to maximum in the experiments. I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested each tweak individually. 1) Add an owner field when a writer holds the lock and introduce optimistic spinning when an active writer is holding the semaphore. It reduced the context switching by 30% to a level very close to the mutex implementation. However, I did not see any throughput improvement of exim. 2) When the sem->count's active field is non-zero (i.e. someone is holding the lock), we can skip directly to the down_write_failed path, without adding the RWSEM_DOWN_WRITE_BIAS and taking it off again from sem->count, saving us two atomic operations. Since we will try the lock stealing again later, this should be okay. Unfortunately it did not improve the exim workload either. Any suggestions on the difference between rwsem and mutex performance and possible improvements to recover this regression? Thanks. Tim vmstat for mutex implementation: procs ---memory-- ---swap-- -io --system-- -cpu- r b swpd free buff cache si sobibo in cs us sy id wa st 38 0 0 130957920 47860 19995600 056 236342 476975 14 72 14 0 0 41 0 0 130938560 47860 21990000 0 0 236816 479676 14 72 14 0 0 vmstat for rw-sem implementation (3.10-rc4) procs ---memory-- ---swap-- -io --system-- -cpu- r b swpd free buff cache si sobibo in cs us sy id wa st 40 0 0 130933984 43232 20258400 0 0 321817 690741 13 71 16 0 0 39 0 0 130913904 43232 22481200 0 0 322193 692949 13 71 16 0 0 Profile for mutex implementation: 5.02% exim [kernel.kallsyms] [k] page_fault 3.67% exim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.66% exim [kernel.kallsyms] [k] unmap_single_vma 2.15% exim [kernel.kallsyms] [k] do_raw_spin_lock 2.14% exim [kernel.kallsyms] [k] page_cache_get_speculative 2.04% exim [kernel.kallsyms] [k] copy_page_rep 1.58% exim [kernel.kallsyms] [k] clear_page_c 1.55% exim [kernel.kallsyms] [k] cpu_relax 1.55% exim [kernel.kallsyms] [k] mutex_unlock 1.42% exim [kernel.kallsyms] [k] __slab_free 1.16% exim [kernel.kallsyms] [k] mutex_lock 1.12% exim libc-2.13.so [.] vfprintf 0.99% exim [kernel.kallsyms] [k] find_vma 0.95% exim [kernel.kallsyms] [k] __list_del_entry Profile for rw-sem implementation 4.88% exim [kernel.kallsyms] [k] page_fault 3.43% exim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.65% exim [kernel.kallsyms] [k] unmap_single_vma 2.46% exim [kernel.kallsyms] [k] do_raw_spin_lock 2.25% exim [kernel.kallsyms] [k] copy_page_rep 2.01% exim [kernel.kallsyms] [k] page_cache_get_speculative 1.81% exim [kernel.kallsyms] [k] clear_page_c 1.51% exim [kernel.kallsyms] [k] __slab_free 1.12% exim libc-2.13.so [.] vfprintf 1.06% exim [kernel.kallsyms] [k] __list_del_entry 1.02% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.00% exim [kernel.kallsyms] [k] find_vma 0.93% exim [kernel.kallsyms] [k] mutex_unlock turbostat for mutex implementation: pk cor CPU%c0 GHz TSC%c1%c3%c6 CTMP %pc3 %pc6 82.91 2.39 2.39 11.65 2.76 2.68 51 0.00 0.00 turbostat of rw-sem implementation (3.10-rc4): pk cor CPU%c0 GHz TSC%c1%c3%c6 CTMP %pc3 %pc6 80.10 2.39 2.39 14.96 2.80 2.13 52 0.00 0.00 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://
Re: Performance regression from switching lock to rw-sem for anon-vma tree
Added copy to mailing list which I forgot in my previous reply: On Thu, 2013-06-13 at 16:43 -0700, Davidlohr Bueso wrote: > On Thu, 2013-06-13 at 16:15 -0700, Tim Chen wrote: > > Ingo, > > > > At the time of switching the anon-vma tree's lock from mutex to > > rw-sem (commit 5a505085), we encountered regressions for fork heavy > > workload. > > A lot of optimizations to rw-sem (e.g. lock stealing) helped to > > mitigate the problem. I tried an experiment on the 3.10-rc4 kernel > > to compare the performance of rw-sem to one that uses mutex. I saw > > a 8% regression in throughput for rw-sem vs a mutex implementation in > > 3.10-rc4. > > Funny, just yesterday I was discussing this issue with Michel. While I > didn't measure the anon-vma mutex->rwem conversion, I did convert the > i_mmap_mutex to a rwsem and noticed a performance regression on a few > aim7 workloads on a 8 socket, 80 core box, when keeping all writers, > which should perform very similarly to a mutex. While some of these > workloads recovered when I shared the lock among readers (similar to > anon-vma), it left me wondering. > > > For the experiments, I used the exim mail server workload in > > the MOSBENCH test suite on 4 socket (westmere) and a 4 socket > > (ivy bridge) with the number of clients sending mail equal > > to number of cores. The mail server will > > fork off a process to handle an incoming mail and put it into mail > > spool. The lock protecting the anon-vma tree is stressed due to > > heavy forking. On both machines, I saw that the mutex implementation > > has 8% more throughput. I've pinned the cpu frequency to maximum > > in the experiments. > > I got some similar -8% throughput on high_systime and shared. > That's interesting. Another perspective on rwsem vs mutex. > > > > I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested > > each tweak individually. > > > > 1) Add an owner field when a writer holds the lock and introduce > > optimistic spinning when an active writer is holding the semaphore. > > It reduced the context switching by 30% to a level very close to the > > mutex implementation. However, I did not see any throughput improvement > > of exim. > > I was hoping that the lack of spin on owner was the main difference with > rwsems and am/was in the middle of implementing it. Could you send your > patch so I can give it a try on my workloads? > > Note that there have been a few recent (3.10) changes to mutexes that > give a nice performance boost, specially on large systems, most > noticeably: > > commit 2bd2c92c (mutex: Make more scalable by doing less atomic > operations) > > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce > cacheline contention) > > It might be worth looking into doing something similar to commit > 0dc8c730, in addition to the optimistic spinning. Okay. Here's my ugly experimental hack with some code lifted from optimistic spin within mutex. I've thought about doing the MCS lock thing but decided to keep the first try on the optimistic spinning simple. Matthew and I have also discussed possibly introducing some limited spinning for writer when semaphore is held by read. His idea was to have readers as well as writers set ->owner. Writers, as now, unconditionally clear owner. Readers clear owner if sem->owner == current. Writers spin on ->owner if ->owner is non-NULL and still active. That gives us a reasonable chance to spin since we'll be spinning on the most recent acquirer of the lock. Tim Signed-off-by: Tim Chen --- diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 0616ffe..331f5f0 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -29,6 +29,7 @@ struct rw_semaphore { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif + struct task_struct *owner; }; extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); diff --git a/kernel/rwsem.c b/kernel/rwsem.c index cfff143..916747f 100644 --- a/kernel/rwsem.c +++ b/kernel/rwsem.c @@ -12,6 +12,16 @@ #include +static inline void rwsem_set_owner(struct rw_semaphore *sem) +{ + sem->owner = current; +} + +static inline void rwsem_clear_owner(struct rw_semaphore *sem) +{ + sem->owner = NULL; +} + /* * lock for reading */ @@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem) rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + rwsem_set_owner(sem); } EXPORT_SYMBOL(down_write); @@ -59,8 +70,10 @@ int down_write_trylock(struct r
Re: Performance regression from switching lock to rw-sem for anon-vma tree
> > Unfortunately this patch didn't make any difference, in fact it hurt > several of the workloads even more. I also tried disabling preemption > when spinning on owner to actually resemble spinlocks, which was my > original plan, yet not much difference. > That's also similar to the performance I got. There are things about optimistic spinning that I missed that results in the better mutex performance. So I'm scratching my head. > A few ideas that come to mind are avoiding taking the ->wait_lock and > avoid dealing with waiters when doing the optimistic spinning (just like > mutexes do). > For my patch, we actually spin without the wait_lock. > I agree that we should first deal with the optimistic spinning before > adding the MCS complexity. > > > Matthew and I have also discussed possibly introducing some > > limited spinning for writer when semaphore is held by read. > > His idea was to have readers as well as writers set ->owner. > > Writers, as now, unconditionally clear owner. Readers clear > > owner if sem->owner == current. Writers spin on ->owner if ->owner > > is non-NULL and still active. That gives us a reasonable chance > > to spin since we'll be spinning on > > the most recent acquirer of the lock. > > I also tried implementing this concept on top of your patch, didn't make > much of a difference with or without it. > It also didn't make a difference for me. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance regression from switching lock to rw-sem for anon-vma tree
On Mon, 2013-06-17 at 09:22 -0700, Davidlohr Bueso wrote: > On Sun, 2013-06-16 at 17:50 +0800, Alex Shi wrote: > > On 06/14/2013 07:43 AM, Davidlohr Bueso wrote: > > > I was hoping that the lack of spin on owner was the main difference with > > > rwsems and am/was in the middle of implementing it. Could you send your > > > patch so I can give it a try on my workloads? > > > > > > Note that there have been a few recent (3.10) changes to mutexes that > > > give a nice performance boost, specially on large systems, most > > > noticeably: > > > > > > commit 2bd2c92c (mutex: Make more scalable by doing less atomic > > > operations) > > > > > > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce > > > cacheline contention) > > > > > > It might be worth looking into doing something similar to commit > > > 0dc8c730, in addition to the optimistic spinning. > > > > It is a good tunning for large machine. I just following what the commit > > 0dc8c730 done, give a RFC patch here. I tried it on my NHM EP machine. > > seems no > > clear help on aim7. but maybe it is helpful on large machine. :) > > After a lot of benchmarking, I finally got the ideal results for aim7, > so far: this patch + optimistic spinning with preemption disabled. Just > like optimistic spinning, this patch by itself makes little to no > difference, yet combined is where we actually outperform 3.10-rc5. In > addition, I noticed extra throughput when disabling preemption in > try_optimistic_spin(). > > With i_mmap as a rwsem and these changes I could see performance > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime > (+5%), shared (+15%) and short (+4%), most of them after around 500 > users, for fewer users, it made little to no difference. > Thanks. Those are encouraging numbers. On my exim workload I didn't get a boost when I added in the preempt disable in optimistic spin and put Alex's changes in. Can you send me your combined patch to see if there may be something you did that I've missed. I have a tweak to Alex's patch below to simplify things a bit. Tim > Thanks, > Davidlohr > > > > > > > diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h > > index bb1e2cd..240729a 100644 > > --- a/include/asm-generic/rwsem.h > > +++ b/include/asm-generic/rwsem.h > > @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore > > *sem) > > > > static inline int __down_write_trylock(struct rw_semaphore *sem) > > { > > - long tmp; > > + if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE)) > > + return 0; > > > > - tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, > > - RWSEM_ACTIVE_WRITE_BIAS); > > - return tmp == RWSEM_UNLOCKED_VALUE; > > + return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, > > + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE; > > } > > > > /* > > diff --git a/lib/rwsem.c b/lib/rwsem.c > > index 19c5fa9..9e54e20 100644 > > --- a/lib/rwsem.c > > +++ b/lib/rwsem.c > > @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum > > rwsem_wake_type wake_type) > > struct rwsem_waiter *waiter; > > struct task_struct *tsk; > > struct list_head *next; > > - long oldcount, woken, loop, adjustment; > > + long woken, loop, adjustment; > > > > waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); > > if (waiter->type == RWSEM_WAITING_FOR_WRITE) { > > @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum > > rwsem_wake_type wake_type) > > * will block as they will notice the queued writer. > > */ > > wake_up_process(waiter->task); > > - goto out; > > + return sem; > > } > > > > /* Writers might steal the lock before we grant it to the next reader. > > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum > > rwsem_wake_type wake_type) > > adjustment = 0; > > if (wake_type != RWSEM_WAKE_READ_OWNED) { > > adjustment = RWSEM_ACTIVE_READ_BIAS; > > - try_reader_grant: > > - oldcount = rwsem_atomic_update(adjustment, sem) - adjustment; > > - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { > > - /* A writer stole the lock. Undo our reader grant. */ > > + while (1) { > > + long oldcount; > > + > > + /* A writer stole the lock. */ > > + if (unlikely(sem->count & RWSEM_ACTIVE_MASK)) > > + return sem; > > + > > + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) { > > + cpu_relax(); > > + continue; > > + } The above two if statements could be cleaned up as a single check: if (unlikely(sem->count < RWSEM_WAITING_BIAS)) return sem; This one s
Re: Performance regression from switching lock to rw-sem for anon-vma tree
On Fri, 2013-06-14 at 15:47 -0700, Michel Lespinasse wrote: > On Fri, Jun 14, 2013 at 3:31 PM, Davidlohr Bueso > wrote: > > A few ideas that come to mind are avoiding taking the ->wait_lock and > > avoid dealing with waiters when doing the optimistic spinning (just like > > mutexes do). > > > > I agree that we should first deal with the optimistic spinning before > > adding the MCS complexity. > > Maybe it would be worth disabling the MCS patch in mutex and comparing > that to the rwsem patches ? Just to make sure the rwsem performance > delta isn't related to that. > I've tried to back out the MCS patch. In fact, for exim, it is about 1% faster without MCS. So the better performance of mutex I saw was not due to MCS. Thanks for the suggestion. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance regression from switching lock to rw-sem for anon-vma tree
On Mon, 2013-06-17 at 12:05 -0700, Davidlohr Bueso wrote: > > > > Thanks. Those are encouraging numbers. On my exim workload I didn't > > get a boost when I added in the preempt disable in optimistic spin and > > put Alex's changes in. Can you send me your combined patch to see if > > there may be something you did that I've missed. I have a tweak to > > Alex's patch below to simplify things a bit. > > > > I'm using: > > int rwsem_optimistic_spin(struct rw_semaphore *sem) > { > struct task_struct *owner; > > /* sem->wait_lock should not be held when attempting optimistic > spinning */ > if (!rwsem_can_spin_on_owner(sem)) > return 0; > > preempt_disable(); > for (;;) { > owner = ACCESS_ONCE(sem->owner); > if (owner && !rwsem_spin_on_owner(sem, owner)) > break; > > /* wait_lock will be acquired if write_lock is obtained */ > if (rwsem_try_write_lock(sem->count, true, sem)) { > preempt_enable(); > return 1; > } > > /* > > > * When there's no owner, we might have preempted between the > > > * owner acquiring the lock and setting the owner field. If > > > * we're an RT task that will live-lock because we won't let > > > * the owner complete. > > > */ > if (!owner && (need_resched() || rt_task(current))) > break; > > /* > > > * The cpu_relax() call is a compiler barrier which forces > > > * everything in this loop to be re-loaded. We don't need > > > * memory barriers as we'll eventually observe the right > > > * values at the cost of a few extra spins. > > > */ > arch_mutex_cpu_relax(); > > } > > preempt_enable(); > return 0; > } This is identical to the changes that I've tested. Thanks for sharing. Tim > > > > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum > > > > rwsem_wake_type wake_type) > > > > adjustment = 0; > > > > if (wake_type != RWSEM_WAKE_READ_OWNED) { > > > > adjustment = RWSEM_ACTIVE_READ_BIAS; > > > > - try_reader_grant: > > > > - oldcount = rwsem_atomic_update(adjustment, sem) - > > > > adjustment; > > > > - if (unlikely(oldcount < RWSEM_WAITING_BIAS)) { > > > > - /* A writer stole the lock. Undo our reader > > > > grant. */ > > > > + while (1) { > > > > + long oldcount; > > > > + > > > > + /* A writer stole the lock. */ > > > > + if (unlikely(sem->count & RWSEM_ACTIVE_MASK)) > > > > + return sem; > > > > + > > > > + if (unlikely(sem->count < RWSEM_WAITING_BIAS)) { > > > > + cpu_relax(); > > > > + continue; > > > > + } > > > > The above two if statements could be cleaned up as a single check: > > > > if (unlikely(sem->count < RWSEM_WAITING_BIAS)) > > return sem; > > > > This one statement is sufficient to check that we don't have a writer > > stolen the lock before we attempt to acquire the read lock by modifying > > sem->count. > > We probably still want to keep the cpu relaxation if the statement > doesn't comply. > > Thanks, > Davidlohr > > > -- > To unsubscribe from this li
Re: Performance regression from switching lock to rw-sem for anon-vma tree
On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote: > On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote: > > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote: > > > After a lot of benchmarking, I finally got the ideal results for aim7, > > > so far: this patch + optimistic spinning with preemption disabled. Just > > > like optimistic spinning, this patch by itself makes little to no > > > difference, yet combined is where we actually outperform 3.10-rc5. In > > > addition, I noticed extra throughput when disabling preemption in > > > try_optimistic_spin(). > > > > > > With i_mmap as a rwsem and these changes I could see performance > > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime > > > (+5%), shared (+15%) and short (+4%), most of them after around 500 > > > users, for fewer users, it made little to no difference. > > > > A pretty good number. what's the cpu number in your machine? :) > > 8-socket, 80 cores (ht off) > > David, I wonder if you are interested to try the experimental patch below. It tries to avoid unnecessary writes to the sem->count when we are going to fail the down_write by executing rwsem_down_write_failed_s instead of rwsem_down_write_failed. It should further reduce the cache line bouncing. It didn't make a difference for my workload. Wonder if it may help yours more in addition to the other two patches. Right now the patch is an ugly hack. I'll merge rwsem_down_write_failed_s and rwsem_down_write_failed into one function if this approach actually helps things. I'll clean these three patches after we have some idea of their effectiveness. Thanks. Tim Signed-off-by: Tim Chen --- commit 04c8ad3f21861746d5b7fff55a6ef186a4dd0765 Author: Tim Chen Date: Mon Jun 10 04:50:04 2013 -0700 Try skip write to rwsem->count when we have active lockers diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 0616ffe..83f9184 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -33,6 +33,7 @@ struct rw_semaphore { extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem); +extern struct rw_semaphore *rwsem_down_write_failed_s(struct rw_semaphore *sem); extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *); extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem); diff --git a/kernel/rwsem.c b/kernel/rwsem.c index cfff143..188f6ea 100644 --- a/kernel/rwsem.c +++ b/kernel/rwsem.c @@ -42,12 +42,22 @@ EXPORT_SYMBOL(down_read_trylock); /* * lock for writing */ + +static void ___down_write(struct rw_semaphore *sem) +{ + if (sem->count & RWSEM_ACTIVE_MASK) { + rwsem_down_write_failed_s(sem); + return; + } + __down_write(sem); +} + void __sched down_write(struct rw_semaphore *sem) { might_sleep(); rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); - LOCK_CONTENDED(sem, __down_write_trylock, __down_write); + LOCK_CONTENDED(sem, __down_write_trylock, ___down_write); } EXPORT_SYMBOL(down_write); diff --git a/lib/rwsem.c b/lib/rwsem.c index 19c5fa9..25143b5 100644 --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -248,6 +248,63 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem) return sem; } +struct rw_semaphore __sched *rwsem_down_write_failed_s(struct rw_semaphore *sem) +{ + long count, adjustment = 0; + struct rwsem_waiter waiter; + struct task_struct *tsk = current; + + /* set up my own style of waitqueue */ + waiter.task = tsk; + waiter.type = RWSEM_WAITING_FOR_WRITE; + + raw_spin_lock_irq(&sem->wait_lock); + if (list_empty(&sem->wait_list)) + adjustment += RWSEM_WAITING_BIAS; + list_add_tail(&waiter.list, &sem->wait_list); + + /* If there were already threads queued before us and there are no +* active writers, the lock must be read owned; so we try to wake +* any read locks that were queued ahead of us. */ + if (adjustment == 0) { + if (sem->count > RWSEM_WAITING_BIAS) + sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS); + } else + count = rwsem_atomic_update(adjustment, sem); + + /* wait until we successfully acquire the lock */ + set_task_state(tsk, TASK_UNINTERRUPTIBLE); + while (true) { + if (!(sem->count & RWSEM_ACTIVE_MASK)) { + /* Try acquiring the write lock. */ + count = RWSEM_ACTIVE_WRITE_BIAS; + if (!list_is_singular(&sem->wait_list)) + count += RWSEM_WAITING_BIAS; +
Re: Performance regression from switching lock to rw-sem for anon-vma tree
On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote: > > vmstat for mutex implementation: > > procs ---memory-- ---swap-- -io --system-- > > -cpu- > > r b swpd free buff cache si sobibo in cs us sy id > > wa st > > 38 0 0 130957920 47860 19995600 056 236342 476975 14 > > 72 14 0 0 > > 41 0 0 130938560 47860 21990000 0 0 236816 479676 14 > > 72 14 0 0 > > > > vmstat for rw-sem implementation (3.10-rc4) > > procs ---memory-- ---swap-- -io --system-- > > -cpu- > > r b swpd free buff cache si sobibo in cs us sy id > > wa st > > 40 0 0 130933984 43232 20258400 0 0 321817 690741 13 > > 71 16 0 0 > > 39 0 0 130913904 43232 22481200 0 0 322193 692949 13 > > 71 16 0 0 > > It appears the main difference is that the rwsem variant context-switches > about 36% more than the mutex version, right? > > I'm wondering how that's possible - the lock is mostly write-locked, > correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse > ought to have brought roughly the same lock-stealing behavior as mutexes > do, right? > > So the next analytical step would be to figure out why rwsem lock-stealing > is not behaving in an equivalent fashion on this workload. Do readers come > in frequently enough to disrupt write-lock-stealing perhaps? > > Context-switch call-graph profiling might shed some light on where the > extra context switches come from... > > Something like: > > perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1 > > or a variant thereof might do the trick. > Ingo, It appears that we are having much more down write failure causing a process to block vs going to the slow path for the mutex case. Here's the profile data from perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1 3.10-rc4 (mutex implementation context switch profile) - 59.51% exim [kernel.kallsyms] [k] perf_trace_sched_switch - perf_trace_sched_switch - __schedule - 99.98% schedule + 33.07% schedule_timeout + 23.84% pipe_wait + 20.24% do_wait + 12.37% do_exit + 5.34% sigsuspend - 3.40% schedule_preempt_disabled __mutex_lock_common.clone.8 __mutex_lock_slowpath - mutex_lock <-low rate mutex blocking + 65.71% lock_anon_vma_root.clone.24 + 19.03% anon_vma_lock.clone.21 + 7.14% dup_mm + 5.36% unlink_file_vma + 1.71% ima_file_check + 0.64% generic_file_aio_write - 1.07% rwsem_down_write_failed call_rwsem_down_write_failed exit_shm do_exit do_group_exit SyS_exit_group system_call_fastpath - 27.61% smtpbm [kernel.kallsyms] [k] perf_trace_sched_switch - perf_trace_sched_switch - __schedule - schedule - schedule_timeout + 100.00% sk_wait_data + 0.46% swapper [kernel.kallsyms] [k] perf_trace_sched_switch -- 3.10-rc4 implementation (rw-sem context switch profile) 81.91% exim [kernel.kallsyms] [k] perf_trace_sched_switch - perf_trace_sched_switch - __schedule - 99.99% schedule - 65.26% rwsem_down_write_failed <--High write lock blocking - call_rwsem_down_write_failed - 79.36% lock_anon_vma_root.clone.27 + 52.64% unlink_anon_vmas + 47.36% anon_vma_clone + 12.16% anon_vma_fork + 8.00% anon_vma_free + 11.96% schedule_timeout + 7.66% do_exit + 7.61% do_wait + 5.49% pipe_wait + 1.82% sigsuspend 13.55% smtpbm [kernel.kallsyms] [k] perf_trace_sched_switch - perf_trace_sched_switch - __schedule - schedule - schedule_timeout 0.11%rcu_sched [kernel.kallsyms] [k] perf_trace_sched_switch Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance regression from switching lock to rw-sem for anon-vma tree
On Wed, 2013-06-19 at 16:11 -0700, Davidlohr Bueso wrote: > On Mon, 2013-06-17 at 17:08 -0700, Tim Chen wrote: > > On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote: > > > On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote: > > > > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote: > > > > > After a lot of benchmarking, I finally got the ideal results for aim7, > > > > > so far: this patch + optimistic spinning with preemption disabled. > > > > > Just > > > > > like optimistic spinning, this patch by itself makes little to no > > > > > difference, yet combined is where we actually outperform 3.10-rc5. In > > > > > addition, I noticed extra throughput when disabling preemption in > > > > > try_optimistic_spin(). > > > > > > > > > > With i_mmap as a rwsem and these changes I could see performance > > > > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), > > > > > high_systime > > > > > (+5%), shared (+15%) and short (+4%), most of them after around 500 > > > > > users, for fewer users, it made little to no difference. > > > > > > > > A pretty good number. what's the cpu number in your machine? :) > > > > > > 8-socket, 80 cores (ht off) > > > > > > > > > > David, > > > > I wonder if you are interested to try the experimental patch below. > > It tries to avoid unnecessary writes to the sem->count when we are > > going to fail the down_write by executing rwsem_down_write_failed_s > > instead of rwsem_down_write_failed. It should further reduce the > > cache line bouncing. It didn't make a difference for my > > workload. Wonder if it may help yours more in addition to the > > other two patches. Right now the patch is an ugly hack. I'll merge > > rwsem_down_write_failed_s and rwsem_down_write_failed into one > > function if this approach actually helps things. > > > > I tried this on top of the patches we've already been dealing with. It > actually did more harm than good. Only got a slight increase in the > five_sec workload, for the rest either no effect, or negative. So far > the best results are still with spin on owner + preempt disable + Alex's > patches. > Thanks for trying it out. A little disappointed as I was expecting no change in performance for the worst case. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] Make the batch size of the percpu_counter configurable
On Wed, 2013-05-29 at 12:26 -0700, Andrew Morton wrote: > On Wed, 22 May 2013 16:37:18 -0700 Tim Chen > wrote: > > > Currently the per cpu counter's batch size for memory accounting is > > configured as twice the number of cpus in the system. However, > > for system with very large memory, it is more appropriate to make it > > proportional to the memory size per cpu in the system. > > > > For example, for a x86_64 system with 64 cpus and 128 GB of memory, > > the batch size is only 2*64 pages (0.5 MB). So any memory accounting > > changes of more than 0.5MB will overflow the per cpu counter into > > the global counter. Instead, for the new scheme, the batch size > > is configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), > > which is more inline with the memory size. > > I renamed the patch to "mm: tune vm_committed_as percpu_counter > batching size". > > Do we have any performance testing results? They're pretty important > for a performance-improvement patch ;) > I've done a repeated brk test of 800KB (from will-it-scale test suite) with 80 concurrent processes on a 4 socket Westmere machine with a total of 40 cores. Without the patch, about 80% of cpu is spent on spin-lock contention within the vm_committed_as counter. With the patch, there's a 73x speedup on the benchmark and the lock contention drops off almost entirely. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/