Re: [PATCH 2/3 v2] Optimize CRC32C calculation with PCLMULQDQ instruction

2013-02-26 Thread Tim Chen
On Tue, 2013-02-26 at 17:54 +0800, Herbert Xu wrote:
> On Thu, Sep 27, 2012 at 03:44:22PM -0700, Tim Chen wrote:
> > This patch adds the crc_pcl function that calculates CRC32C checksum using 
> > the
> > PCLMULQDQ instruction on processors that support this feature. This will
> > provide speedup over using CRC32 instruction only.
> > The usage of PCLMULQDQ necessitate the invocation of kernel_fpu_begin and
> > kernel_fpu_end and incur some overhead.  So the new crc_pcl function is only
> > invoked for buffer size of 512 bytes or more.  Larger sized
> > buffers will expect to see greater speedup.  This feature is best used 
> > coupled
> > with eager_fpu which reduces the kernel_fpu_begin/end overhead.  For
> > buffer size of 1K the speedup is around 1.6x and for buffer size greater 
> > than
> > 4K, the speedup is around 3x compared to original implementation in 
> > crc32c-intel
> > module. Test was performed on Sandy Bridge based platform with constant 
> > frequency
> > set for cpu.
> > 
> > A white paper detailing the algorithm can be found here:
> > http://download.intel.com/design/intarch/papers/323405.pdf
> > 
> > Signed-off-by: Tim Chen 

Acked.  

BTW, Herbert, I've also sent you a patch a few days ago to update the
link to the whitepaper on the CRC32C algorithm in the code.  Wonder if
you have received it.  Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Update the links to the white papers on CRC32C calculations with PCLMULQDQ instructions.

2013-02-21 Thread Tim Chen
Herbert,

The following patch update the stale link to the CRC32C white paper 
that was referenced.

Tim


Signed-off-by: Tim Chen 
---
 arch/x86/crypto/crc32c-pcl-intel-asm_64.S |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S 
b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index 93c6d39..f3f1e42 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -1,9 +1,10 @@
 /*
  * Implement fast CRC32C with PCLMULQDQ instructions. (x86_64)
  *
- * The white paper on CRC32C calculations with PCLMULQDQ instruction can be
+ * The white papers on CRC32C calculations with PCLMULQDQ instruction can be
  * downloaded from:
- * http://download.intel.com/design/intarch/papers/323405.pdf
+ * 
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
+ * 
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-paper.pdf
  *
  * Copyright (C) 2012 Intel Corporation.
  *
-- 
1.7.6.5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex

2012-09-04 Thread Tim Chen
On Tue, 2012-08-21 at 17:48 -0700, Tim Chen wrote:

> 
> Thanks to Matthew's suggestions on improving the patch. Here's the
> updated version.  It seems to be sane when I booted my machine up.  I
> will put it through more testing when I get a chance.
> 
> Tim
> 

Matthew,

The new patch seems to be causing some of the workloads with mmaped file
read to seg fault.  Will need to dig further to find out why.

Tim


> ---
> Signed-off-by: Tim Chen 
> ---
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index fd07c45..f1320b1 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -156,8 +156,11 @@ static inline void page_dup_rmap(struct page *page)
>  /*
>   * Called from mm/vmscan.c to handle paging out
>   */
> -int page_referenced(struct page *, int is_locked,
> - struct mem_cgroup *memcg, unsigned long *vm_flags);
> +int needs_page_mmap_mutex(struct page *page);
> +int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg,
> + unsigned long *vm_flags);
> +int __page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg,
> + unsigned long *vm_flags);
>  int page_referenced_one(struct page *, struct vm_area_struct *,
>   unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
>  
> @@ -176,6 +179,7 @@ enum ttu_flags {
>  bool is_vma_temporary_stack(struct vm_area_struct *vma);
>  
>  int try_to_unmap(struct page *, enum ttu_flags flags);
> +int __try_to_unmap(struct page *, enum ttu_flags flags);
>  int try_to_unmap_one(struct page *, struct vm_area_struct *,
>   unsigned long address, enum ttu_flags flags);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5b5ad58..ca8cd21 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -843,8 +843,7 @@ static int page_referenced_file(struct page *page,
>* so we can safely take mapping->i_mmap_mutex.
>*/
>   BUG_ON(!PageLocked(page));
> -
> - mutex_lock(&mapping->i_mmap_mutex);
> + BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex));
>  
>   /*
>* i_mmap_mutex does not stabilize mapcount at all, but mapcount
> @@ -869,21 +868,16 @@ static int page_referenced_file(struct page *page,
>   break;
>   }
>  
> - mutex_unlock(&mapping->i_mmap_mutex);
>   return referenced;
>  }
>  
> -/**
> - * page_referenced - test if the page was referenced
> - * @page: the page to test
> - * @is_locked: caller holds lock on the page
> - * @memcg: target memory cgroup
> - * @vm_flags: collect encountered vma->vm_flags who actually referenced the 
> page
> - *
> - * Quick test_and_clear_referenced for all mappings to a page,
> - * returns the number of ptes which referenced the page.
> - */
> -int page_referenced(struct page *page,
> +int needs_page_mmap_mutex(struct page *page)
> +{
> + return page->mapping && page_mapped(page) && page_rmapping(page) &&
> + !PageKsm(page) && !PageAnon(page);
> +}
> +
> +int __page_referenced(struct page *page,
>   int is_locked,
>   struct mem_cgroup *memcg,
>   unsigned long *vm_flags)
> @@ -919,6 +913,32 @@ out:
>   return referenced;
>  }
>  
> +/**
> + * page_referenced - test if the page was referenced
> + * @page: the page to test
> + * @is_locked: caller holds lock on the page
> + * @memcg: target memory cgroup
> + * @vm_flags: collect encountered vma->vm_flags who actually referenced the 
> page
> + *
> + * Quick test_and_clear_referenced for all mappings to a page,
> + * returns the number of ptes which referenced the page.
> + */
> +int page_referenced(struct page *page,
> + int is_locked,
> + struct mem_cgroup *memcg,
> + unsigned long *vm_flags)
> +{
> + int result, needs_lock;
> +
> + needs_lock = needs_page_mmap_mutex(page);
> + if (needs_lock)
> + mutex_lock(&page->mapping->i_mmap_mutex);
> + result = __page_referenced(page, is_locked, memcg, vm_flags);
> + if (needs_lock)
> + mutex_unlock(&page->mapping->i_mmap_mutex);
> + return result;
> +}
> +
>  static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>   unsigned long address)
>  {
> @@ -1560,7 +1580,7 @@ static int try_to_unmap_file(struct page *page, enum 
> ttu_flags flags)
>   unsigned long max_nl_size = 0;
>   unsigned int mapcount;
>  
> - mutex_lock(&mapping->i_mmap_mutex)

Re: [RFC PATCH 2/2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex

2012-09-04 Thread Tim Chen
On Tue, 2012-09-04 at 08:21 -0700, Tim Chen wrote:
> On Tue, 2012-08-21 at 17:48 -0700, Tim Chen wrote:
> 
> > 
> > Thanks to Matthew's suggestions on improving the patch. Here's the
> > updated version.  It seems to be sane when I booted my machine up.  I
> > will put it through more testing when I get a chance.
> > 
> > Tim
> > 
> 
> Matthew,
> 
> The new patch seems to be causing some of the workloads with mmaped file
> read to seg fault.  Will need to dig further to find out why.
> 
> Tim
> 

Okay, the problem seems to be the code below.  It is too restrictive and
causes some cases where the mutex needs to be taken in try_to_unmap_file
to be missed.

> > +int needs_page_mmap_mutex(struct page *page)
> > +{
> > +   return page->mapping && page_mapped(page) && page_rmapping(page) &&
> > +   !PageKsm(page) && !PageAnon(page);
> > +}
> > +

Changing the check to the following fixes the problem:

@@ -873,8 +873,7 @@ static int page_referenced_file(struct page *page,
 
 int needs_page_mmap_mutex(struct page *page)
 {
-   return page->mapping && page_mapped(page) && page_rmapping(page) &&
-   !PageKsm(page) && !PageAnon(page);
+   return page->mapping && !PageKsm(page) && !PageAnon(page);
 }

I'll do more testing and generate a second version of the patch set with the 
fixes.

Tim



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3 v2] mm: Batch page reclamation under shink_page_list

2012-09-12 Thread Tim Chen
On Wed, 2012-09-12 at 12:27 -0700, Andrew Morton wrote:

> 
> That sounds good, although more details on the performance changes
> would be appreciated - after all, that's the entire point of the
> patchset.
> 
> And we shouldn't only test for improvements - we should also test for
> degradation.  What workloads might be harmed by this change?  I'd suggest
> 
> - a single process which opens N files and reads one page from each
>   one, then repeats.  So there are no contiguous LRU pages which share
>   the same ->mapping.  Get some page reclaim happening, measure the
>   impact.
> 
> - The batching means that we now do multiple passes over pageframes
>   where we used to do things in a single pass.  Walking all those new
>   page lists will be expensive if they are lengthy enough to cause L1
>   cache evictions.

I need to address both your concerns and Mel's concerns about the
downside of prolonging the holding page locks for the pages to be
unmmaped for patch 1 in the series.  I'll try to do some testing to see
what kind of benefit I get by only batching operations under the
i_mmap_mutex (i.e. patch 2 and 3 only) and not do batch unmap. Those
other changes don't have the downsides of prolonged page locking and we
can incorporate them with less risks.

> 
>   What would be a test for this?  A simple, single-threaded walk
>   through a file, I guess?

Thanks for your test suggestions.  I will do tests along your
suggestions when I generate the next iterations of the patch. 

I've been playing with these patches for a while and they are based on
3.4 kernel.  I'll move them to 3.6 kernel in my next iteration.

> 
> Mel's review comments were useful, thanks.

Very much appreciate comments from you, Mel and Minchan. I'll try to
incorporate them in my changes. 

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list

2012-09-13 Thread Tim Chen
On Tue, 2012-09-11 at 12:05 +0100, Mel Gorman wrote:

> 
> One *massive* change here that is not called out in the changelog is that
> the reclaim path now holds the page lock on multiple pages at the same
> time waiting for them to be batch unlocked in __remove_mapping_batch.
> This is suspicious for two reasons.
> 
> The first suspicion is that it is expected that there are filesystems
> that lock multiple pages in page->index order and page reclaim tries to
> lock pages in a random order.  You are "ok" because you trylock the pages
> but there should be a comment explaining the situation and why you're
> ok.
> 
> My *far* greater concern is that the hold time for a locked page is
> now potentially much longer. You could lock a bunch of filesystem pages
> and then call pageout() on an swapcache page that takes a long time to
> write. This potentially causes a filesystem (or flusher threads etc)
> to stall on lock_page and that could cause all sorts of latency trouble.
> It will be hard to hit this bug and diagnose it but I believe it's
> there.
> 
> That second risk *really* must be commented upon and ideally reviewed by
> the filesystem people. However, I very strongly suspect that the outcome
> of such a review will be a suggestion to unlock the pages and reacquire
> the lock in __remove_mapping_batch(). Bear in mind that if you take this
> approach that you *must* use trylock when reacquiring the page lock and
> handle being unable to lock the page.
> 

Mel,

Thanks for your detailed comments and analysis.  If I unlock the pages,
will flusher threads be the only things that will touch them?  Or do I
have to worry about potentially other things done to the pages that will
make it invalid for me to unmap the pages later and put them on free
list?

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3 v2] mm: Batch page reclamation under shink_page_list

2012-09-13 Thread Tim Chen
On Tue, 2012-09-11 at 14:36 +0900, Minchan Kim wrote:

> 
> If you send next versions, please use git-format-patch --thread style.
> Quote from man
> "   If given --thread, git-format-patch will generate In-Reply-To and 
> References
>headers to make the second and subsequent patch mails appear as 
> replies to the
>first mail; this also generates a Message-Id header to reference.
> "
> 
> It helps making your all patches in this series thread type in mailbox
> so reviewers can find all patches related to a subject easily.
> 

Thanks for your detailed review of the patch series.  Will incorporate
your suggestions on the next version.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3 v2] Optimize CRC32C calculation using PCLMULQDQ in crc32c-intel module

2012-10-05 Thread Tim Chen
On Fri, 2012-09-28 at 10:57 +0800, Herbert Xu wrote:
> 在 2012-9-28 上午10:54,"H. Peter Anvin" 写道:
> >
> > On 09/27/2012 03:44 PM, Tim Chen wrote:
> >>
> >> Version 2
> >> This version of the patch series fixes compilation errors for
> >> 32 bit x86 targets.
> >>
> >> Version 1
> >> This patch series optimized CRC32C calculations with PCLMULQDQ
> >> instruction for crc32c-intel module.  It speeds up the original
> >> implementation by 1.6x for 1K buffer and by 3x for buffer 4k or
> >> more.  The tcrypt module was enhanced for doing speed test
> >> on crc32c calculations.
> >>
> >
> > Herbert - you are handling this one, right?
> 
> Yes thanks
> 
> 

Hi Herbert, are there any other changes I should do on my end?  Is
the patchset ready to be picked up in crypto-dev?

Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] Rename crc32c-intel.c to crc32c-intel_glue.c

2012-09-25 Thread Tim Chen
This patch rename the crc32c-intel.c file to crc32c-intel_glue.c file
in preparation for linking with the new crc32c-pcl-intel-asm.S file,
which contains optimized crc32c calculation based on PCLMULQDQ
instruction.

Tim

Signed-off-by: Tim Chen 
---
 arch/x86/crypto/Makefile   |1 +
 .../crypto/{crc32c-intel.c => crc32c-intel_glue.c} |0
 2 files changed, 1 insertions(+), 0 deletions(-)
 rename arch/x86/crypto/{crc32c-intel.c => crc32c-intel_glue.c} (100%)

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index e908e5d..edd2268 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -43,3 +43,4 @@ serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o 
serpent_avx_glue.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
+crc32c-intel-y := crc32c-intel_glue.o
diff --git a/arch/x86/crypto/crc32c-intel.c 
b/arch/x86/crypto/crc32c-intel_glue.c
similarity index 100%
rename from arch/x86/crypto/crc32c-intel.c
rename to arch/x86/crypto/crc32c-intel_glue.c
-- 
1.7.7.6



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] Optimize CRC32C calculation with PCLMULQDQ instruction

2012-09-25 Thread Tim Chen
This patch adds the crc_pcl function that calculates CRC32C checksum using the
PCLMULQDQ instruction on processors that support this feature. This will
provide speedup over using CRC32 instruction only.
The usage of PCLMULQDQ necessitate the invocation of kernel_fpu_begin and
kernel_fpu_end and incur some overhead.  So the new crc_pcl function is only
invoked for buffer size of 512 bytes or more.  Larger sized
buffers will expect to see greater speedup.  This feature is best used coupled
with eager_fpu which reduces the kernel_fpu_begin/end overhead.  For
buffer size of 1K the speedup is around 1.6x and for buffer size greater than
4K, the speedup is around 3x compared to original implementation in crc32c-intel
module. Test was performed on Sandy Bridge based platform with constant 
frequency 
set for cpu.

A white paper detailing the algorithm can be found here:
http://download.intel.com/design/intarch/papers/323405.pdf

Tim

Signed-off-by: Tim Chen 
---
 arch/x86/crypto/Makefile   |2 +-
 arch/x86/crypto/crc32c-intel_glue.c|   75 +
 arch/x86/crypto/crc32c-pcl-intel-asm.S |  460 
 3 files changed, 536 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/crypto/crc32c-pcl-intel-asm.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index edd2268..0babdcb 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -43,4 +43,4 @@ serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o 
serpent_avx_glue.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
-crc32c-intel-y := crc32c-intel_glue.o
+crc32c-intel-y := crc32c-pcl-intel-asm.o crc32c-intel_glue.o
diff --git a/arch/x86/crypto/crc32c-intel_glue.c 
b/arch/x86/crypto/crc32c-intel_glue.c
index 493f959..432af71 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -32,9 +32,31 @@
 
 #include 
 #include 
+#include 
+#include 
 
 #define CHKSUM_BLOCK_SIZE  1
 #define CHKSUM_DIGEST_SIZE 4
+/*
+ * use carryless multiply version of crc32c when buffer
+ * size is >= 512 (when eager fpu is enabled) or
+ * >= 1024 (when eager fpu is disabled) to account
+ * for fpu state save/restore overhead.
+ */
+#define CRC32C_PCL_BREAKEVEN_EAGERFPU  512
+#define CRC32C_PCL_BREAKEVEN_NOEAGERFPU1024
+
+static int crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_EAGERFPU;
+#if defined(X86_FEATURE_EAGER_FPU)
+#define set_pcl_breakeven_point()  \
+do {   \
+   if (!use_eager_fpu())   \
+   crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU; \
+} while (0)
+#else
+#define set_pcl_breakeven_point()  \
+   (crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU)
+#endif
 
 #define SCALE_Fsizeof(unsigned long)
 
@@ -44,6 +66,9 @@
 #define REX_PRE
 #endif
 
+asmlinkage unsigned int crc_pcl(const u8 *buffer, int len,
+   unsigned int crc_init);
+
 static u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t 
length)
 {
while (length--) {
@@ -117,6 +142,24 @@ static int crc32c_intel_update(struct shash_desc *desc, 
const u8 *data,
return 0;
 }
 
+static int crc32c_pcl_intel_update(struct shash_desc *desc, const u8 *data,
+  unsigned int len)
+{
+   u32 *crcp = shash_desc_ctx(desc);
+
+   /*
+* use faster PCL version if datasize is large enough to
+* overcome kernel fpu state save/restore overhead
+*/
+   if (len >= crc32c_pcl_breakeven && irq_fpu_usable()) {
+   kernel_fpu_begin();
+   *crcp = crc_pcl(data, len, *crcp);
+   kernel_fpu_end();
+   } else
+   *crcp = crc32c_intel_le_hw(*crcp, data, len);
+   return 0;
+}
+
 static int __crc32c_intel_finup(u32 *crcp, const u8 *data, unsigned int len,
u8 *out)
 {
@@ -124,12 +167,31 @@ static int __crc32c_intel_finup(u32 *crcp, const u8 
*data, unsigned int len,
return 0;
 }
 
+static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int 
len,
+   u8 *out)
+{
+   if (len >= crc32c_pcl_breakeven && irq_fpu_usable()) {
+   kernel_fpu_begin();
+   *(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp));
+   kernel_fpu_end();
+   } else
+   *(__le32 *)out =
+   ~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len));
+   return 0;
+}
+
 static int crc32c_intel_finup(struct shash_desc *desc, const u8 *data,
  unsigned int len, u8 *out)
 {
return __crc32c_intel_finup(shash_de

[PATCH 3/3] Added speed test in tcrypt for crc32c

2012-09-25 Thread Tim Chen
This patch adds a test case in tcrypt to perform speed test for
crc32c checksum calculation.

Tim

Signed-off-by: Tim Chen 
---
 crypto/tcrypt.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 581081d..6deb77f 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1437,6 +1437,10 @@ static int do_test(int m)
test_hash_speed("ghash-generic", sec, hash_speed_template_16);
if (mode > 300 && mode < 400) break;
 
+   case 319:
+   test_hash_speed("crc32c", sec, generic_hash_speed_template);
+   if (mode > 300 && mode < 400) break;
+
case 399:
break;
 
-- 
1.7.7.6


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] Optimize CRC32C calculation using PCLMULQDQ in crc32c-intel module

2012-09-25 Thread Tim Chen
This patch series optimized CRC32C calculations with PCLMULQDQ
instruction for crc32c-intel module.  It speeds up the original
implementation by 1.6x for 1K buffer and by 3x for buffer 4k or
more.  The tcrypt module was enhanced for doing speed test
on crc32c calculations.

Tim

Signed-off-by: Tim Chen 
---
Tim Chen (3):
  Rename crc32c-intel.c to crc32c-intel_glue.c
  Optimize CRC32C calculation with PCLMULQDQ instruction
  Added speed test in tcrypt for crc32c

 arch/x86/crypto/Makefile   |1 +
 .../crypto/{crc32c-intel.c => crc32c-intel_glue.c} |   75 
 arch/x86/crypto/crc32c-pcl-intel-asm.S |  460 
 crypto/tcrypt.c|4 +
 4 files changed, 540 insertions(+), 0 deletions(-)
 rename arch/x86/crypto/{crc32c-intel.c => crc32c-intel_glue.c} (70%)
 create mode 100644 arch/x86/crypto/crc32c-pcl-intel-asm.S

-- 
1.7.7.6


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Avoid useless inodes and dentries reclamation

2013-08-28 Thread Tim Chen
This patch detects that when free inodes and dentries are really
low, their reclamation is skipped so we do not have to contend
on the global sb_lock uselessly under memory pressure. Otherwise
we create a log jam trying to acquire the sb_lock in prune_super(),
with little or no freed memory to show for the effort.

The profile below shows a multi-threaded large file read exerting
pressure on memory with page cache usage.  It is dominated
by the sb_lock contention in the cpu cycles profile.  The patch
eliminates the sb_lock contention almost entirely for prune_super().

43.94%   usemem  [kernel.kallsyms] [k] _raw_spin_lock
 |
 --- _raw_spin_lock
|
|--32.44%-- grab_super_passive
|  prune_super
|  shrink_slab
|  do_try_to_free_pages
|  try_to_free_pages
|  __alloc_pages_nodemask
|  alloc_pages_current
|
|--32.18%-- put_super
|  drop_super
|  prune_super
|  shrink_slab
|  do_try_to_free_pages
|  try_to_free_pages
|  __alloc_pages_nodemask
|  alloc_pages_current

Signed-off-by: Tim Chen 
---
 fs/super.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 68307c0..70fa26c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
+#define SB_CACHE_LOW 5
 static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 {
struct super_block *sb;
@@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct 
shrink_control *sc)
if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
return -1;
 
+   /*
+* Don't prune if we have few cached objects to reclaim to
+* avoid useless sb_lock contention
+*/
+   if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
+   return -1;
+
if (!grab_super_passive(sb))
return -1;
 
-- 
1.7.11.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid useless inodes and dentries reclamation

2013-08-28 Thread Tim Chen
On Thu, 2013-08-29 at 00:19 +0300, Kirill A. Shutemov wrote:

> > ---
> >  fs/super.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 68307c0..70fa26c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
> >   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
> >   * take a passive reference to the superblock to avoid this from occurring.
> >   */
> > +#define SB_CACHE_LOW 5
> >  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> > struct super_block *sb;
> > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct 
> > shrink_control *sc)
> > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> > return -1;
> >  
> > +   /*
> > +* Don't prune if we have few cached objects to reclaim to
> > +* avoid useless sb_lock contention
> > +*/
> > +   if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> > +   return -1;
> 
> I don't think it's correct: you don't account fs_objects here and
> prune_icache_sb() calls invalidate_mapping_pages() which can free a lot of
> memory. It's too naive approach. You can miss a memory hog easily this
> way.

Is it safe to compute sb->s_op->nr_cached_objects(sb), assuming non null
s_op without holding sb_lock to increment ref count on sb?  
I think it is safe as we hold the shrinker_rwsem so we cannot 
unregister the shrinker and the s_op and sb
structure should still be there.  However, I'm not totally sure.

Tim

Signed-off-by: Tim Chen 
---
diff --git a/fs/super.c b/fs/super.c
index 68307c0..173d0d9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
+#define SB_CACHE_LOW 5
 static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 {
struct super_block *sb;
@@ -68,6 +69,17 @@ static int prune_super(struct shrinker *shrink, struct 
shrink_control *sc)
if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
return -1;
 
+   total_objects = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused;
+   if (sb->s_op && sb->s_op->nr_cached_objects)
+   total_objects += sb->s_op->nr_cached_objects(sb);
+
+   /*
+* Don't prune if we have few cached objects to reclaim to
+* avoid useless sb_lock contention
+*/
+   if (total_objects <= SB_CACHE_LOW)
+   return -1;
+
if (!grab_super_passive(sb))
return -1;
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-01 Thread Tim Chen
On Mon, 2013-09-30 at 12:36 -0400, Waiman Long wrote:
> On 09/30/2013 12:10 PM, Jason Low wrote:
> > On Mon, 2013-09-30 at 11:51 -0400, Waiman Long wrote:
> >> On 09/28/2013 12:34 AM, Jason Low wrote:
>  Also, below is what the mcs_spin_lock() and mcs_spin_unlock()
>  functions would look like after applying the proposed changes.
> 
>  static noinline
>  void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node 
>  *node)
>  {
>    struct mcs_spin_node *prev;
> 
>    /* Init node */
>    node->locked = 0;
>    node->next   = NULL;
> 
>    prev = xchg(lock, node);
>    if (likely(prev == NULL)) {
>    /* Lock acquired. No need to set node->locked since it
>  won't be used */
>    return;
>    }
>    ACCESS_ONCE(prev->next) = node;
>    /* Wait until the lock holder passes the lock down */
>    while (!ACCESS_ONCE(node->locked))
>    arch_mutex_cpu_relax();
>    smp_mb();
> >> I wonder if a memory barrier is really needed here.
> > If the compiler can reorder the while (!ACCESS_ONCE(node->locked)) check
> > so that the check occurs after an instruction in the critical section,
> > then the barrier may be necessary.
> >
> 
> In that case, just a barrier() call should be enough.

The cpu could still be executing out of order load instruction from the
critical section before checking node->locked?  Probably smp_mb() is
still needed.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-01 Thread Tim Chen
On Tue, 2013-10-01 at 16:01 -0400, Waiman Long wrote:
> On 10/01/2013 12:48 PM, Tim Chen wrote:
> > On Mon, 2013-09-30 at 12:36 -0400, Waiman Long wrote:
> >> On 09/30/2013 12:10 PM, Jason Low wrote:
> >>> On Mon, 2013-09-30 at 11:51 -0400, Waiman Long wrote:
> >>>> On 09/28/2013 12:34 AM, Jason Low wrote:
> >>>>>> Also, below is what the mcs_spin_lock() and mcs_spin_unlock()
> >>>>>> functions would look like after applying the proposed changes.
> >>>>>>
> >>>>>> static noinline
> >>>>>> void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node 
> >>>>>> *node)
> >>>>>> {
> >>>>>>struct mcs_spin_node *prev;
> >>>>>>
> >>>>>>/* Init node */
> >>>>>>node->locked = 0;
> >>>>>>node->next   = NULL;
> >>>>>>
> >>>>>>prev = xchg(lock, node);
> >>>>>>if (likely(prev == NULL)) {
> >>>>>>/* Lock acquired. No need to set node->locked since 
> >>>>>> it
> >>>>>> won't be used */
> >>>>>>return;
> >>>>>>}
> >>>>>>ACCESS_ONCE(prev->next) = node;
> >>>>>>/* Wait until the lock holder passes the lock down */
> >>>>>>while (!ACCESS_ONCE(node->locked))
> >>>>>>arch_mutex_cpu_relax();
> >>>>>>smp_mb();
> >>>> I wonder if a memory barrier is really needed here.
> >>> If the compiler can reorder the while (!ACCESS_ONCE(node->locked)) check
> >>> so that the check occurs after an instruction in the critical section,
> >>> then the barrier may be necessary.
> >>>
> >> In that case, just a barrier() call should be enough.
> > The cpu could still be executing out of order load instruction from the
> > critical section before checking node->locked?  Probably smp_mb() is
> > still needed.
> >
> > Tim
> 
> But this is the lock function, a barrier() call should be enough to 
> prevent the critical section from creeping up there. We certainly need 
> some kind of memory barrier at the end of the unlock function.

I may be missing something.  My understanding is that barrier only
prevents the compiler from rearranging instructions, but not for cpu out
of order execution (as in smp_mb). So cpu could read memory in the next
critical section, before node->locked is true, (i.e. unlock has been
completed).  If we only have a simple barrier at end of mcs_lock, then
say the code on CPU1 is

mcs_lock
x = 1;
...
x = 2;
mcs_unlock

and CPU 2 is

mcs_lock
y = x;
...
mcs_unlock

We expect y to be 2 after the "y = x" assignment.  But we
we may execute the code as

CPU1CPU2

x = 1;
... y = x;  ( y=1, out of order load)
x = 2
mcs_unlock
Check node->locked==true
continue executing critical section (y=1 when we expect 
y=2)

So we get y to be 1 when we expect that it should be 2.  Adding smp_mb
after the node->locked check in lock code

   ACCESS_ONCE(prev->next) = node;
   /* Wait until the lock holder passes the lock down */
   while (!ACCESS_ONCE(node->locked))
arch_mutex_cpu_relax();
   smp_mb();

should prevent this scenario.  

Thanks.
Tim

> 
> -Longman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-02 Thread Tim Chen
On Tue, 2013-10-01 at 21:25 -0400, Waiman Long wrote: 
> On 10/01/2013 05:16 PM, Tim Chen wrote:
> > On Tue, 2013-10-01 at 16:01 -0400, Waiman Long wrote:
> >>>
> >>> The cpu could still be executing out of order load instruction from the
> >>> critical section before checking node->locked?  Probably smp_mb() is
> >>> still needed.
> >>>
> >>> Tim
> >> But this is the lock function, a barrier() call should be enough to
> >> prevent the critical section from creeping up there. We certainly need
> >> some kind of memory barrier at the end of the unlock function.
> > I may be missing something.  My understanding is that barrier only
> > prevents the compiler from rearranging instructions, but not for cpu out
> > of order execution (as in smp_mb). So cpu could read memory in the next
> > critical section, before node->locked is true, (i.e. unlock has been
> > completed).  If we only have a simple barrier at end of mcs_lock, then
> > say the code on CPU1 is
> >
> > mcs_lock
> > x = 1;
> > ...
> > x = 2;
> > mcs_unlock
> >
> > and CPU 2 is
> >
> > mcs_lock
> > y = x;
> > ...
> > mcs_unlock
> >
> > We expect y to be 2 after the "y = x" assignment.  But we
> > we may execute the code as
> >
> > CPU1CPU2
> > 
> > x = 1;
> > ... y = x;  ( y=1, out of order load)
> > x = 2
> > mcs_unlock
> > Check node->locked==true
> > continue executing critical section (y=1 when we expect 
> > y=2)
> >
> > So we get y to be 1 when we expect that it should be 2.  Adding smp_mb
> > after the node->locked check in lock code
> >
> > ACCESS_ONCE(prev->next) = node;
> > /* Wait until the lock holder passes the lock down */
> > while (!ACCESS_ONCE(node->locked))
> >  arch_mutex_cpu_relax();
> > smp_mb();
> >
> > should prevent this scenario.
> >
> > Thanks.
> > Tim
> 
> If the lock and unlock functions are done right, there should be no 
> overlap of critical section. So it is job of the lock/unlock functions 
> to make sure that critical section code won't leak out. There should be 
> some kind of memory barrier at the beginning of the lock function and 
> the end of the unlock function.
> 
> The critical section also likely to have branches. The CPU may 
> speculatively execute code on the 2 branches, but one of them will be 
> discarded once the branch condition is known. Also 
> arch_mutex_cpu_relax() is a compiler barrier by itself. So we may not 
> need a barrier() after all. The while statement is a branch instruction, 
> any code after that can only be speculatively executed and cannot be 
> committed until the branch is done.

But the condition code may be checked after speculative execution? 
The condition may not be true during speculative execution and only
turns true when we check the condition, and take that branch?

The thing that bothers me is without memory barrier after the while
statement, we could speculatively execute before affirming the lock is 
in acquired state. Then when we check the lock, the lock is set
to acquired state in the mean time.
We could be loading some memory entry *before*
the node->locked has been set true.  I think a smp_rmb (if not a 
smp_mb) should be set after the while statement.

At first I was also thinking that the memory barrier is not 
necessary but Paul convinced me otherwise in a previous email. 
https://lkml.org/lkml/2013/9/27/523 

> 
> In x86, the smp_mb() function translated to a mfence instruction which 
> cost time. That is why I try to get rid of it if it is not necessary.
> 

I also hope that the memory barrier is not necessary and I am missing
something obvious.  But I haven't been able to persuade myself.

Tim 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 7/9] MCS Lock: Barrier corrections

2013-10-02 Thread Tim Chen
This patch corrects the way memory barriers are used in the MCS lock
and removes ones that are not needed. Also add comments on all barriers.

Signed-off-by: Jason Low 
---
 include/linux/mcs_spinlock.h |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index 96f14299..93d445d 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -36,16 +36,19 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct 
mcs_spinlock *node)
node->locked = 0;
node->next   = NULL;
 
+   /* xchg() provides a memory barrier */
prev = xchg(lock, node);
if (likely(prev == NULL)) {
/* Lock acquired */
return;
}
ACCESS_ONCE(prev->next) = node;
-   smp_wmb();
/* Wait until the lock holder passes the lock down */
while (!ACCESS_ONCE(node->locked))
arch_mutex_cpu_relax();
+
+   /* Make sure subsequent operations happen after the lock is acquired */
+   smp_rmb();
 }
 
 /*
@@ -58,6 +61,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, 
struct mcs_spinlock *nod
 
if (likely(!next)) {
/*
+* cmpxchg() provides a memory barrier.
 * Release the lock by setting it to NULL
 */
if (likely(cmpxchg(lock, node, NULL) == node))
@@ -65,9 +69,14 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, 
struct mcs_spinlock *nod
/* Wait until the next pointer is set */
while (!(next = ACCESS_ONCE(node->next)))
arch_mutex_cpu_relax();
+   } else {
+   /*
+* Make sure all operations within the critical section
+* happen before the lock is released.
+*/
+   smp_wmb();
}
ACCESS_ONCE(next->locked) = 1;
-   smp_wmb();
 }
 
 #endif /* __LINUX_MCS_SPINLOCK_H */
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock

2013-10-02 Thread Tim Chen
Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi 
---
 include/asm-generic/rwsem.h |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+   return 0;
 
-   tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
-   return tmp == RWSEM_UNLOCKED_VALUE;
+   return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-02 Thread Tim Chen
We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Reviewed-by: Ingo Molnar 
Reviewed-by: Peter Zijlstra 
Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/mcs_spinlock.h |   64 ++
 include/linux/mutex.h|5 ++-
 kernel/mutex.c   |   60 --
 3 files changed, 74 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
new file mode 100644
index 000..b5de3b0
--- /dev/null
+++ b/include/linux/mcs_spinlock.h
@@ -0,0 +1,64 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+   struct mcs_spinlock *next;
+   int locked; /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+   struct mcs_spinlock *prev;
+
+   /* Init node */
+   node->locked = 0;
+   node->next   = NULL;
+
+   prev = xchg(lock, node);
+   if (likely(prev == NULL)) {
+   /* Lock acquired */
+   node->locked = 1;
+   return;
+   }
+   ACCESS_ONCE(prev->next) = node;
+   smp_wmb();
+   /* Wait until the lock holder passes the lock down */
+   while (!ACCESS_ONCE(node->locked))
+   arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock 
*node)
+{
+   struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+   if (likely(!next)) {
+   /*
+* Release the lock by setting it to NULL
+*/
+   if (cmpxchg(lock, node, NULL) == node)
+   return;
+   /* Wait until the next pointer is set */
+   while (!(next = ACCESS_ONCE(node->next)))
+   arch_mutex_cpu_relax();
+   }
+   ACCESS_ONCE(next->locked) = 1;
+   smp_wmb();
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ccd4260..e6eaeea 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,6 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
+struct mcs_spinlock;
 struct mutex {
/* 1: unlocked, 0: locked, negative: locked, possible waiters */
atomic_tcount;
@@ -55,7 +56,7 @@ struct mutex {
struct task_struct  *owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-   void*spin_mlock;/* Spinner MCS lock */
+   struct mcs_spinlock *mcs_lock;  /* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
@@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct 
mutex *lock);
 #define arch_mutex_cpu_relax() cpu_relax()
 #endif
 
-#endif
+#endif /* __LINUX_MUTEX_H */
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..4640731 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct 
lock_class_key *key)
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-   lock->spin_mlock = NULL;
+   lock->mcs_lock = NULL;
 #endif
 
debug_mutex_init(lock, name, key);
@@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-   struct mspin_node *next ;
-   int   locked;   /* 1 if lock acquired */
-};
-#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock))
-
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-   struct mspin_node *prev;
-
-   /* Init node */
-   node->

[PATCH v8 4/9] rwsem/wake: check lock before do atomic update

2013-10-02 Thread Tim Chen
Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso 
Suggested-by: Tim Chen 
Signed-off-by: Alex Shi 
---
 lib/rwsem.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
struct rwsem_waiter *waiter;
struct task_struct *tsk;
struct list_head *next;
-   long oldcount, woken, loop, adjustment;
+   long woken, loop, adjustment;
 
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
while (1) {
+   long oldcount;
+
+   /* A writer stole the lock. */
+   if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+   return sem;
+
oldcount = rwsem_atomic_update(adjustment, sem)
- adjustment;
if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake

2013-10-02 Thread Tim Chen
That make code simple and more readable

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
adjustment = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-   oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-   /* A writer stole the lock. Undo our reader grant. */
+   while (1) {
+   oldcount = rwsem_atomic_update(adjustment, sem)
+   - adjustment;
+   if (likely(oldcount >= RWSEM_WAITING_BIAS))
+   break;
+
+/* A writer stole the lock.  Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
return sem;
/* Last active locker left. Retry waking readers. */
-   goto try_reader_grant;
}
}
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 0/9] rwsem performance optimizations

2013-10-02 Thread Tim Chen
For version 8 of the patchset, we included the patch from Waiman
to streamline wakeup operations and also optimize the MCS lock
used in rwsem and mutex.

In this patchset, we introduce three categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The next four patches from Tim, Davidlohr and Jason
introduce optimistic spinning logic similar to that in the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.  We have
also factored out the MCS lock originally in the mutex code into its
own file, and performed micro optimizations and corrected the memory
barriers so it could be used for general lock/unlock of critical
sections.
 
The last patch from Waiman help to streamline the wake up operation
by avoiding multiple threads all doing wakeup operations when only
one wakeup thread is enough.  This significantly reduced lock
contentions from multiple wakeup threads. 

Tim got the following improvement for exim mail server 
workload on 40 core system:

Alex+Tim's patchset:   +4.8%
Alex+Tim+Waiman's patchset:+5.3%

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With Alex
and Tim's patches, he got significant improvements to the aim7 
suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

More Aim7 numbers will be posted when Davidlohr has a chance
to test the complete patchset including Waiman's patch.

Thanks to Ingo Molnar, Peter Hurley, Peter Zijlstra and Paul McKenney
for helping to review this patchset.

Tim

Changelog:

v8:
1. Added Waiman's patch to avoid multiple wakeup thread lock contention.
2. Micro-optimizations of MCS lock.
3. Correct the barriers of MCS lock to prevent critical sections from
leaking.

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7.
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be 
detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Jason Low (2):
  MCS Lock: optimizations and extra comments
  MCS Lock: Barrier corrections

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
own file
  rwsem: do optimistic spinning for writer lock acquisition

Waiman Long (1):
  rwsem: reduce spinlock contention in wakeup code path

 include/asm-generic/rwsem.h  |8 +-
 include/linux/mcs_spinlock.h |   82 ++
 include/linux/mutex.h|5 +-
 include/linux/rwsem.h|9 ++-
 kernel/mutex.c   |   60 +-
 kernel/rwsem.c   |   19 +++-
 lib/rwsem.c  |  255 +-
 7 files changed, 349 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 6/9] MCS Lock: optimizations and extra comments

2013-10-02 Thread Tim Chen
Remove unnecessary operation and make the cmpxchg(lock, node, NULL) == node
check in mcs_spin_unlock() likely() as it is likely that a race did not occur
most of the time.

Also add in more comments describing how the local node is used in MCS locks.

Signed-off-by: Jason Low 
---
 include/linux/mcs_spinlock.h |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index b5de3b0..96f14299 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -18,6 +18,12 @@ struct mcs_spinlock {
 };
 
 /*
+ * In order to acquire the lock, the caller should declare a local node and
+ * pass a reference of the node to this function in addition to the lock.
+ * If the lock has already been acquired, then this will proceed to spin
+ * on this node->locked until the previous lock holder sets the node->locked
+ * in mcs_spin_unlock().
+ *
  * We don't inline mcs_spin_lock() so that perf can correctly account for the
  * time spent in this lock function.
  */
@@ -33,7 +39,6 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct 
mcs_spinlock *node)
prev = xchg(lock, node);
if (likely(prev == NULL)) {
/* Lock acquired */
-   node->locked = 1;
return;
}
ACCESS_ONCE(prev->next) = node;
@@ -43,6 +48,10 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct 
mcs_spinlock *node)
arch_mutex_cpu_relax();
 }
 
+/*
+ * Releases the lock. The caller should pass in the corresponding node that
+ * was used to acquire the lock.
+ */
 static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock 
*node)
 {
struct mcs_spinlock *next = ACCESS_ONCE(node->next);
@@ -51,7 +60,7 @@ static void mcs_spin_unlock(struct mcs_spinlock **lock, 
struct mcs_spinlock *nod
/*
 * Release the lock by setting it to NULL
 */
-   if (cmpxchg(lock, node, NULL) == node)
+   if (likely(cmpxchg(lock, node, NULL) == node))
return;
/* Wait until the next pointer is set */
while (!(next = ACCESS_ONCE(node->next)))
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition

2013-10-02 Thread Tim Chen
We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Ingo Molnar 
Reviewed-by: Peter Zijlstra 
Reviewed-by: Peter Hurley 
Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/rwsem.h |7 ++-
 kernel/rwsem.c|   19 +-
 lib/rwsem.c   |  201 -
 3 files changed, 206 insertions(+), 21 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..aba7920 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -22,10 +22,13 @@ struct rw_semaphore;
 #include  /* use a generic implementation */
 #else
 /* All arch specific implementations share the same struct */
+struct mcs_spinlock;
 struct rw_semaphore {
longcount;
raw_spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner; /* write owner */
+   struct mcs_spinlock *mcs_lock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map  dep_map;
 #endif
@@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)  \
{ RWSEM_UNLOCKED_VALUE, \
  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
- LIST_HEAD_INIT((name).wait_list)  \
+ LIST_HEAD_INIT((name).wait_list), \
+ NULL, \
+ NULL  \
  __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include 
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+   sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+   sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
int ret = __down_write_trylock(sem);
 
-   if (ret == 1)
+   if (ret == 1) {
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+   rwsem_set_owner(sem);
+   }
return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
__up_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 * dependency.
 */
__downgrade_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct 
lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int 
subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..cc3b33e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->count = RWSEM_UNLOCKED_VALUE;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+   sem->owner = NULL;
+   sem->mcs_lock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,177 @@ struct rw_semaphore __sched 
*rwsem_down_read_failed(struct rw_semaphore *sem)
return sem;
 }
 
+static inline int rwsem_try_write_lock(long count,

[PATCH v8 2/9] rwsem: remove 'out' label in do_wake

2013-10-02 Thread Tim Chen
That make code simple and more readable.

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 * will block as they will notice the queued writer.
 */
wake_up_process(waiter->task);
-   goto out;
+   return sem;
}
 
/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
/* A writer stole the lock. Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
-   goto out;
+   return sem;
/* Last active locker left. Retry waking readers. */
goto try_reader_grant;
}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
sem->wait_list.next = next;
next->prev = &sem->wait_list;
 
- out:
return sem;
 }
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path

2013-10-02 Thread Tim Chen
With the 3.12-rc2 kernel, there is sizable spinlock contention on
the rwsem wakeup code path when running AIM7's high_systime workload
on a 8-socket 80-core DL980 (HT off) as reported by perf:

  7.64%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
 |--41.77%-- rwsem_wake
  1.61%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
 |--92.37%-- rwsem_down_write_failed

That was 4.7% of recorded CPU cycles.

On a large NUMA machine, it is entirely possible that a fairly large
number of threads are queuing up in the ticket spinlock queue to do
the wakeup operation. In fact, only one will be needed.  This patch
tries to reduce spinlock contention by doing just that.

A new wakeup field is added to the rwsem structure. This field is
set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
thread is pending to do the wakeup call. It is cleared on exit from
those functions. There is no size increase in 64-bit systems and a
4 bytes size increase in 32-bit systems.

By checking if the wakeup flag is set, a thread can exit rwsem_wake()
immediately if another thread is pending to do the wakeup instead of
waiting to get the spinlock and find out that nothing need to be done.

The setting of the wakeup flag may not be visible on all processors in
some architectures. However, this won't affect program correctness. The
clearing of the wakeup flag before spin_unlock and other barrier-type
atomic instructions will ensure that it is visible to all processors.

With this patch alone, the performance improvement on jobs per minute
(JPM) of an sample run of the high_systime workload (at 1500 users)
on DL980 was as follows:

HT  JPM w/o patch   JPM with patch  % Change
--  -   --  
off148265  170896+15.3%
on 140078  159319+13.7%

The new perf profile (HT off) was as follows:

  2.96%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
 |--0.94%-- rwsem_wake
  1.00%   reaim  [kernel.kallsyms]   [k] _raw_spin_lock_irq
 |--88.70%-- rwsem_down_write_failed

Together with the rest of rwsem patches in the series, the JPM number
(HT off) jumps to 195041 which is 32% better.

Signed-off-by: Waiman Long 
---
 include/linux/rwsem.h |2 ++
 lib/rwsem.c   |   29 +
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index aba7920..29314d3 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -26,6 +26,7 @@ struct mcs_spinlock;
 struct rw_semaphore {
longcount;
raw_spinlock_t  wait_lock;
+   int wakeup; /* Waking-up in progress flag */
struct list_headwait_list;
struct task_struct  *owner; /* write owner */
struct mcs_spinlock *mcs_lock;
@@ -61,6 +62,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)  \
{ RWSEM_UNLOCKED_VALUE, \
  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
+ 0,\
  LIST_HEAD_INIT((name).wait_list), \
  NULL, \
  NULL  \
diff --git a/lib/rwsem.c b/lib/rwsem.c
index cc3b33e..1adee01 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -27,6 +27,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
lockdep_init_map(&sem->dep_map, name, key, 0);
 #endif
sem->count = RWSEM_UNLOCKED_VALUE;
+   sem->wakeup = 0;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
sem->owner = NULL;
@@ -70,6 +71,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
struct list_head *next;
long woken, loop, adjustment;
 
+   sem->wakeup = 1;/* Waking up in progress */
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
@@ -79,6 +81,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 * will block as they will notice the queued writer.
 */
wake_up_process(waiter->task);
+   sem->wakeup = 0;/* Wakeup done */
return sem;
}
 
@@ -87,6 +90,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 * so we can bail out early if a writer stole the lock.
 */
adjustment = 0;
+   sem->wakeup = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
while (1) {
@@ -426,11 +430,36 @@ struct rw_semaphore *rwsem_wake(st

[PATCH 1/2] fs/superblock: Unregister sb shrinker before ->kill_sb()

2013-09-06 Thread Tim Chen
Dave, 

I've created the two patches we've discussed on the
mail thread regarding avoiding grab_super_passive during
memory reclaim.

Tim

From:   Dave Chinner 
We will like to unregister the sb shrinker before ->kill_sb().
This will allow cached objects to be counted without call to
grab_super_passive() to update ref count on sb. We want
to avoid locking during memory reclamation especially when
we are skipping the memory reclaim when we are out of
cached objects.

This is safe because grab_super_passive does a try-lock on the sb->s_umount
now, and so if we are in the unmount process, it won't ever block.
That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:

shrinkerumount

down_read(shrinker_rwsem)
down_write(sb->s_umount)
shrinker_unregister
  down_write(shrinker_rwsem)

grab_super_passive(sb)
  down_read_trylock(sb->s_umount)




up_read(shrinker_rwsem)
  
  
  up_write(shrinker_rwsem)
->kill_sb()


So it is safe to deregister the shrinker before ->kill_sb().

Signed-off-by: Tim Chen 
---
 fs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 73d0952..b724f35 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -324,10 +324,10 @@ void deactivate_locked_super(struct super_block *s)
struct file_system_type *fs = s->s_type;
if (atomic_dec_and_test(&s->s_active)) {
cleancache_invalidate_fs(s);
+   unregister_shrinker(&s->s_shrink);
fs->kill_sb(s);
 
/* caches are now gone, we can safely kill the shrinker now */
-   unregister_shrinker(&s->s_shrink);
list_lru_destroy(&s->s_dentry_lru);
list_lru_destroy(&s->s_inode_lru);
 
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid useless inodes and dentries reclamation

2013-09-06 Thread Tim Chen
On Fri, 2013-09-06 at 10:55 +1000, Dave Chinner wrote:
> On Tue, Sep 03, 2013 at 11:38:27AM -0700, Tim Chen wrote:
> > On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote:
> > > On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote:
> > > >
> > > > 
> > > > Signed-off-by: Tim Chen 
> > > > ---
> > > > diff --git a/fs/super.c b/fs/super.c
> > > > index 73d0952..4df1fab 100644
> > > > --- a/fs/super.c
> > > > +++ b/fs/super.c
> > > > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct 
> > > > shrinker *shrink,
> > > >  
> > > > sb = container_of(shrink, struct super_block, s_shrink);
> > > >  
> > > > -   if (!grab_super_passive(sb))
> > > > -   return 0;
> > > > -
> > > 
> > > I think the function needs a comment explaining why we aren't
> > > grabbing the sb here, otherwise people are going to read the code
> > > and ask why it's different to the scanning callout.
> > > 
> > > > if (sb->s_op && sb->s_op->nr_cached_objects)
> > > > total_objects = sb->s_op->nr_cached_objects(sb,
> > > >  sc->nid);
> > > 
> > 
> > Yes, those comments are needed.
> > I also need to remove the corresponding
> > drop_super(sb);
> > 
> > So probably something like:
> > 
> > ---
> > diff --git a/fs/super.c b/fs/super.c
> > index 73d0952..7b5a6e5 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker 
> > *shrink,
> >  
> > sb = container_of(shrink, struct super_block, s_shrink);
> >  
> > -   if (!grab_super_passive(sb))
> > -   return 0;
> > -
> > +   /*
> > +* Don't call grab_super_passive as it is a potential 
> > +* scalability bottleneck. The counts could get updated 
> > +* between super_cache_count and super_cache_scan anyway.
> > +* Call to super_cache_count with shrinker_rwsem held
> > +* ensures the safety of call to list_lru_count_node() and 
> > +* s_op->nr_cached_objects().
> > +*/
> 
> Well, that's not true of s_op->nr_cached_objects() right now. It's
> only going to be true if the shrinker deregistration is moved before
> ->kill_sb()
> 
> > > Let me have a bit more of a think about this - the solution may
> > > simply be unregistering the shrinker before we call ->kill_sb() so
> > > the shrinker can't get called while we are tearing down the fs.
> > > First, though, I need to go back and remind myself of why I put that
> > > after ->kill_sb() in the first place.  
> > 
> > Seems very reasonable as I haven't found a case where the shrinker 
> > is touched in ->kill_sb() yet. It looks like unregistering the
> > shrinker before ->kill_sb() should be okay.
> 
> Having looked at it some more, I have to agree. I think the original
> reason for unregistering the shrinker there was to avoid problems
> with locking - the shrinker callouts are run holding the
> shrinker_rwsem in read mode, and then we lock the sb->s_umount in
> read mount. In the unmount case, we currently take the sb->s_umount
> lock in write mode (thereby locking out the shrinker) but we drop it
> before deregistering the shrinker and so there is no inverted
> locking order.
> 
> The thing is, grab_super_passive does a try-lock on the sb->s_umount
> now, and so if we are in the unmount process, it won't ever block.
> That means what used to be a deadlock and races we were avoiding
> by using grab_super_passive() is now:
> 
>   shrinkerumount
> 
>   down_read(shrinker_rwsem)
>   down_write(sb->s_umount)
>   shrinker_unregister
> down_write(shrinker_rwsem)
>   
>   grab_super_passive(sb)
> down_read_trylock(sb->s_umount)
>   
>   
>   
>   
>   up_read(shrinker_rwsem)
> 
> 
> up_write(shrinker_rwsem)
>   ->kill_sb()
>   
> 
> And so it appears to be safe to deregister the shrinker before
> ->kill_sb().
> 
> Can you do this as two patches? The first moves the shrinker
> deregistration to before ->kill_sb(), then second is the above patch
> that drops the grab-super_passive() calls from the ->count_objects
> function?

I've sent the patches on a separate mail thread.  Please add
your sign-off if the patches look okay.

Thanks.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] fs/superblock: Avoid locking counting inodes and dentries before reclaiming them

2013-09-06 Thread Tim Chen
We remove the call to grab_super_passive in call to super_cache_count.
This becomes a scalability bottleneck as multiple threads are trying
to do memory reclamation, e.g. when we are doing large amount of
file read and page cache is under pressure.  The cached objects
quickly got reclaimed down to 0 and we are aborting the cache_scan()
reclaim.  But counting creates a log jam acquiring the sb_lock.

We are holding the shrinker_rwsem which ensures the safety of
call to list_lru_count_node() and s_op->nr_cached_objects.  The
shrinker is unregistered now before ->kill_sb() so the operation is
safe when we are doing unmount.

Signed-off-by: Tim Chen 
---
 fs/super.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index b724f35..b5c9fdf 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker 
*shrink,
 
sb = container_of(shrink, struct super_block, s_shrink);
 
-   if (!grab_super_passive(sb))
-   return 0;
-
+   /*
+* Don't call grab_super_passive as it is a potential
+* scalability bottleneck. The counts could get updated
+* between super_cache_count and super_cache_scan anyway.
+* Call to super_cache_count with shrinker_rwsem held
+* ensures the safety of call to list_lru_count_node() and
+* s_op->nr_cached_objects().
+*/
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb,
 sc->nid);
@@ -125,7 +130,6 @@ static unsigned long super_cache_count(struct shrinker 
*shrink,
 sc->nid);
 
total_objects = vfs_pressure_ratio(total_objects);
-   drop_super(sb);
return total_objects;
 }
 
-- 
1.7.11.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-07 Thread Tim Chen
On Thu, 2013-10-03 at 09:32 +0200, Ingo Molnar wrote:
> * Tim Chen  wrote:
> 
> > For version 8 of the patchset, we included the patch from Waiman to 
> > streamline wakeup operations and also optimize the MCS lock used in 
> > rwsem and mutex.
> 
> I'd be feeling a lot easier about this patch series if you also had 
> performance figures that show how mmap_sem is affected.
> 
> These:
> 
> > Tim got the following improvement for exim mail server 
> > workload on 40 core system:
> > 
> > Alex+Tim's patchset:   +4.8%
> > Alex+Tim+Waiman's patchset:+5.3%
> 
> appear to be mostly related to the anon_vma->rwsem. But once that lock is 
> changed to an rwlock_t, this measurement falls away.
> 
> Peter Zijlstra suggested the following testcase:
> 
> ===>
> In fact, try something like this from userspace:
> 
> n-threads:
> 
>   pthread_mutex_lock(&mutex);
>   foo = mmap();
>   pthread_mutex_lock(&mutex);
> 
>   /* work */
> 
>   pthread_mutex_unlock(&mutex);
>   munma(foo);
>   pthread_mutex_unlock(&mutex);
> 
> vs
> 
> n-threads:
> 
>   foo = mmap();
>   /* work */
>   munmap(foo);


Ingo,

I ran the vanilla kernel, the kernel with all rwsem patches and the
kernel with all patches except the optimistic spin one.  
I am listing two presentations of the data.  Please note that
there is about 5% run-run variation.

% change in performance vs vanilla kernel
#threadsall without optspin
mmap only   
1   1.9%1.6%
5   43.8%   2.6%
10  22.7%   -3.0%
20  -12.0%  -4.5%
40  -26.9%  -2.0%
mmap with mutex acquisition 
1   -2.1%   -3.0%
5   -1.9%   1.0%
10  4.2%12.5%
20  -4.1%   0.6%
40  -2.8%   -1.9%

The optimistic spin case does very well at low to moderate contentions,
but worse when there are very heavy contentions for the pure mmap case.
For the case with pthread mutex, there's not much change from vanilla
kernel.

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all without optspin
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

In general, vanilla and no-optspin case perform better with 
pthread-mutex.  For the case with optspin, mmap with 
pthread-mutex is worse at low to moderate contention and better
at high contention.

Tim

> 
> I've had reports that the former was significantly faster than the
> latter.
> <===
> 
> this could be put into a standalone testcase, or you could add it as a new 
> subcommand of 'perf bench', which already has some pthread code, see for 
> example in tools/perf/bench/sched-messaging.c. Adding:
> 
>perf bench mm threads
> 
> or so would be a natural thing to have.
> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-10-08 Thread Tim Chen
On Tue, 2013-10-08 at 16:51 -0300, Rafael Aquini wrote:
> On Wed, Oct 02, 2013 at 03:38:32PM -0700, Tim Chen wrote:
> > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > Extracting the MCS code from mutex.c and put into its own file allow us
> > to reuse this code easily for rwsem.
> > 
> > Reviewed-by: Ingo Molnar 
> > Reviewed-by: Peter Zijlstra 
> > Signed-off-by: Tim Chen 
> > Signed-off-by: Davidlohr Bueso 
> > ---
> >  include/linux/mcs_spinlock.h |   64 
> > ++
> >  include/linux/mutex.h|5 ++-
> >  kernel/mutex.c   |   60 --
> >  3 files changed, 74 insertions(+), 55 deletions(-)
> >  create mode 100644 include/linux/mcs_spinlock.h
> > 
> > diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
> > new file mode 100644
> > index 000..b5de3b0
> > --- /dev/null
> > +++ b/include/linux/mcs_spinlock.h
> > @@ -0,0 +1,64 @@
> > +/*
> > + * MCS lock defines
> > + *
> > + * This file contains the main data structure and API definitions of MCS 
> > lock.
> > + *
> > + * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple 
> > spin-lock
> > + * with the desirable properties of being fair, and with each cpu trying
> > + * to acquire the lock spinning on a local variable.
> > + * It avoids expensive cache bouncings that common test-and-set spin-lock
> > + * implementations incur.
> > + */
> 
> nitpick:
> 
> I believe you need 
> 
> +#include 
> 
> here, to avoid breaking the build when arch_mutex_cpu_relax() is not defined
> (arch/s390 is one case)

Probably 

+#include  

should be added instead?
It defines arch_mutex_cpu_relax when there's no 
architecture specific version.

Thanks.
Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 0/9] rwsem performance optimizations

2013-10-09 Thread Tim Chen
On Wed, 2013-10-09 at 08:15 +0200, Ingo Molnar wrote:
> * Tim Chen  wrote:
> 
> > Ingo,
> > 
> > I ran the vanilla kernel, the kernel with all rwsem patches and the 
> > kernel with all patches except the optimistic spin one.  I am listing 
> > two presentations of the data.  Please note that there is about 5% 
> > run-run variation.
> > 
> > % change in performance vs vanilla kernel
> > #threadsall without optspin
> > mmap only   
> > 1   1.9%1.6%
> > 5   43.8%   2.6%
> > 10  22.7%   -3.0%
> > 20  -12.0%  -4.5%
> > 40  -26.9%  -2.0%
> > mmap with mutex acquisition 
> > 1   -2.1%   -3.0%
> > 5   -1.9%   1.0%
> > 10  4.2%12.5%
> > 20  -4.1%   0.6%
> > 40  -2.8%   -1.9%
> 
> Silly question: how do the two methods of starting N threads compare to 
> each other? 

They both started N pthreads and run for a fixed time. 
The throughput of pure mmap with mutex is below vs pure mmap is below:

% change in performance of the mmap with pthread-mutex vs pure mmap
#threadsvanilla all rwsem   without optspin
patches
1   3.0%-1.0%   -1.7%
5   7.2%-26.8%  5.5%
10  5.2%-10.6%  22.1%
20  6.8%16.4%   12.5%
40  -0.2%   32.7%   0.0%

So with mutex, the vanilla kernel and the one without optspin both
run faster.  This is consistent with what Peter reported.  With
optspin, the picture is more mixed, with lower throughput at low to
moderate number of threads and higher throughput with high number
of threads.

> Do they have identical runtimes? 

Yes, they both have identical runtimes.  I look at the number 
of mmap and munmap operations I could push through.

> I think PeterZ's point was 
> that the pthread_mutex case, despite adding extra serialization, actually 
> runs faster in some circumstances.

Yes, I also see the pthread mutex run faster for the vanilla kernel
from the data above.

> 
> Also, mind posting the testcase? What 'work' do the threads do - clear 
> some memory area? 

The test case do simple mmap and munmap 1MB memory per iteration.

> How big is the memory area?

1MB

The two cases are created as:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
while (1) {
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);

(*iterations)++;
}
}

and adding mutex to serialize:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB with
mutex";

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void testcase(unsigned long long *iterations)
{
while (1) {
pthread_mutex_lock(&mutex);
char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);
munmap(c, MEMSIZE);
pthread_mutex_unlock(&mutex);

(*iterations)++;
}
}

and run as a pthread.
> 
> I'd expect this to be about large enough mmap()s showing page fault 
> processing to be mmap_sem bound and the serialization via pthread_mutex() 
> sets up a 'train' of threads in one case, while the parallel startup would 
> run into the mmap_sem in the regular case.
> 
> So I'd expect this to be a rather sensitive workload and you'd have to 
> actively engineer it to hit the effect PeterZ mentioned. I could imagine 
> MPI workloads to run into such patterns - but not deterministically.
> 
> Only once you've convinced yourself that you are hitting that kind of 
> effect reliably on the vanilla kernel, could/should the effects of an 
> improved rwsem implementation be measured.
> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid useless inodes and dentries reclamation

2013-08-29 Thread Tim Chen

> > Signed-off-by: Tim Chen 
> > ---
> >  fs/super.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 68307c0..70fa26c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -53,6 +53,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
> >   * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
> >   * take a passive reference to the superblock to avoid this from occurring.
> >   */
> > +#define SB_CACHE_LOW 5
> >  static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> > struct super_block *sb;
> > @@ -68,6 +69,13 @@ static int prune_super(struct shrinker *shrink, struct 
> > shrink_control *sc)
> > if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> > return -1;
> >  
> > +   /*
> > +* Don't prune if we have few cached objects to reclaim to
> > +* avoid useless sb_lock contention
> > +*/
> > +   if ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) <= SB_CACHE_LOW)
> > +   return -1;
> 
> Those counters no longer exist in the current mmotm tree and the
> shrinker infrastructure is somewhat different, so this patch isn't
> the right way to solve this problem.

These changes in mmotm tree do complicate solutions for this
scalability issue.

> 
> Given that superblock LRUs and shrinkers in mmotm are node aware,
> there may even be more pressure on the sblock in such a workload.  I
> think the right way to deal with this is to give the shrinker itself
> a "minimum call count" so that we can avoid even attempting to
> shrink caches that does have enough entries in them to be worthwhile
> shrinking.

By "minimum call count", do you mean tracking the number of free
entries per node in the shrinker, and invoking shrinker 
only when the number of free entries
exceed "minimum call count"?  There is some cost in
list_lru_count_node to get the free entries, as we need
to acquire the node's lru lock.  Alternatively, we can
set a special flag/node by list_add or list_del when count goes
above/below a threshold and invoke shrinker based on this flag.

Or do you mean that if we do not reap any memory in a shrink
operation, we do a certain number of backoffs of shrink operation
till the "minimum call count" is reached?


> 
> That said, the memcg guys have been saying that even small numbers
> of items per cache can be meaningful in terms of memory reclaim
> (e.g. when there are lots of memcgs) then such a threshold might
> only be appropriate for caches that are not memcg controlled. 

I've done some experiment with the CACHE thresholds.  Even setting
the threshold at 0 (i.e. there're no free entries) remove almost all 
the needless contentions.  That should make the memcg guys happy by
not holding any extra free entries.

> In
> that case, handling it in the shrinker infrastructure itself is a
> much better idea than hacking thresholds into individual shrinker
> callouts.

Currently the problem is mostly with the sb shrinker due to the
sb_lock.  If we can have a general solution, that will be even
better.

Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid useless inodes and dentries reclamation

2013-08-30 Thread Tim Chen
On Fri, 2013-08-30 at 11:40 +1000, Dave Chinner wrote:

> 
> The new shrinker infrastructure has a ->count_objects() callout to
> specifically return the number of objects in the cache.
> shrink_slab_node() can check that return value against the "minimum
> call count" and determine whether it needs to call ->scan_objects()
> at all.
> 
> Actually, the shrinker already behaves like this with the batch_size
> variable - the shrinker has to be asking for more items to be
> scanned than the batch size. That means the problem is that counting
> callouts are causing the problem, not the scanning callouts.
> 
> With the new code in the mmotm tree, for counting purposes we
> probably don't need to grab a passive superblock reference at all -
> the superblock and LRUs are guaranteed to be valid if the shrinker
> is currently running, but we don't really care if the superblock is
> being shutdown and the values that come back are invalid because the
> ->scan_objects() callout will fail to grab the superblock to do
> anything with the calculated values.

If that's the case, then we should remove grab_super_passive
from the super_cache_count code.  That should remove the bottleneck
in reclamation.

Thanks for your detailed explanation.

Tim

Signed-off-by: Tim Chen 
---
diff --git a/fs/super.c b/fs/super.c
index 73d0952..4df1fab 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker 
*shrink,
 
sb = container_of(shrink, struct super_block, s_shrink);
 
-   if (!grab_super_passive(sb))
-   return 0;
-
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb,
 sc->nid);




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Avoid useless inodes and dentries reclamation

2013-09-03 Thread Tim Chen
On Sat, 2013-08-31 at 19:00 +1000, Dave Chinner wrote:
> On Fri, Aug 30, 2013 at 09:21:34AM -0700, Tim Chen wrote:
> >
> > 
> > Signed-off-by: Tim Chen 
> > ---
> > diff --git a/fs/super.c b/fs/super.c
> > index 73d0952..4df1fab 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -112,9 +112,6 @@ static unsigned long super_cache_count(struct shrinker 
> > *shrink,
> >  
> > sb = container_of(shrink, struct super_block, s_shrink);
> >  
> > -   if (!grab_super_passive(sb))
> > -   return 0;
> > -
> 
> I think the function needs a comment explaining why we aren't
> grabbing the sb here, otherwise people are going to read the code
> and ask why it's different to the scanning callout.
> 
> > if (sb->s_op && sb->s_op->nr_cached_objects)
> > total_objects = sb->s_op->nr_cached_objects(sb,
> >  sc->nid);
> 

Yes, those comments are needed.
I also need to remove the corresponding
drop_super(sb);

So probably something like:

---
diff --git a/fs/super.c b/fs/super.c
index 73d0952..7b5a6e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -112,9 +112,14 @@ static unsigned long super_cache_count(struct shrinker 
*shrink,
 
sb = container_of(shrink, struct super_block, s_shrink);
 
-   if (!grab_super_passive(sb))
-   return 0;
-
+   /*
+* Don't call grab_super_passive as it is a potential 
+* scalability bottleneck. The counts could get updated 
+* between super_cache_count and super_cache_scan anyway.
+* Call to super_cache_count with shrinker_rwsem held
+* ensures the safety of call to list_lru_count_node() and 
+* s_op->nr_cached_objects().
+*/
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb,
 sc->nid);
@@ -125,7 +130,6 @@ static unsigned long super_cache_count(struct shrinker 
*shrink,
 sc->nid);
 
total_objects = vfs_pressure_ratio(total_objects);
-   drop_super(sb);
return total_objects;
 }
 


> But seeing this triggered further thought on my part. Being called
> during unmount means that ->nr_cached_objects implementations need
> to be robust against unmount tearing down private filesystem
> structures.  Right now, grab_super_passive() protects us from that
> because it won't be able to get the sb->s_umount lock while
> generic_shutdown_super() is doing it's work.
> 
> IOWs, the superblock based shrinker operations are safe because the
> structures don't get torn down until after the shrinker is
> unregistered. That's not true for the structures that
> ->nr_cached_objects() use: ->put_super() tears them down before the
> shrinker is unregistered and only grab_super_passive() protects us
> from thay.
> 
> Let me have a bit more of a think about this - the solution may
> simply be unregistering the shrinker before we call ->kill_sb() so
> the shrinker can't get called while we are tearing down the fs.
> First, though, I need to go back and remind myself of why I put that
> after ->kill_sb() in the first place.  

Seems very reasonable as I haven't found a case where the shrinker 
is touched in ->kill_sb() yet. It looks like unregistering the
shrinker before ->kill_sb() should be okay.

> If we unregister the shrinker
> before ->kill_sb is called, then we can probably get rid of
> grab_super_passive() in both shrinker callouts because they will no
> longer need to handle running concurrently with ->kill_sb()
> 

Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-26 Thread Tim Chen
On Thu, 2013-09-26 at 10:40 +0200, Peter Zijlstra wrote:
> On Thu, Sep 26, 2013 at 08:46:29AM +0200, Ingo Molnar wrote:
> > > +/*
> > > + * MCS lock defines
> > > + *
> > > + * This file contains the main data structure and API definitions of MCS 
> > > lock.
> > 
> > A (very) short blurb about what an MCS lock is would be nice here.
> 
> A while back I suggested including a link to something like:
> 
> http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
> 
> Its a fairly concise write-up of the idea; only 6 pages. The sad part
> about linking to the web is that links tend to go dead after a while.

Link rot is a problem.  If I provide a few details about MCS lock I
think people should be able to google for it.

Tim



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-26 Thread Tim Chen
On Thu, 2013-09-26 at 14:09 -0700, Jason Low wrote:
> On Thu, 2013-09-26 at 13:40 -0700, Davidlohr Bueso wrote:
> > On Thu, 2013-09-26 at 13:23 -0700, Jason Low wrote:
> > > On Thu, 2013-09-26 at 13:06 -0700, Davidlohr Bueso wrote:
> > > > On Thu, 2013-09-26 at 12:27 -0700, Jason Low wrote:
> > > > > On Wed, Sep 25, 2013 at 3:10 PM, Tim Chen 
> > > > >  wrote:
> > > > > > We will need the MCS lock code for doing optimistic spinning for 
> > > > > > rwsem.
> > > > > > Extracting the MCS code from mutex.c and put into its own file 
> > > > > > allow us
> > > > > > to reuse this code easily for rwsem.
> > > > > >
> > > > > > Signed-off-by: Tim Chen 
> > > > > > Signed-off-by: Davidlohr Bueso 
> > > > > > ---
> > > > > >  include/linux/mcslock.h |   58 
> > > > > > +++
> > > > > >  kernel/mutex.c  |   58 
> > > > > > +-
> > > > > >  2 files changed, 65 insertions(+), 51 deletions(-)
> > > > > >  create mode 100644 include/linux/mcslock.h
> > > > > >
> > > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> > > > > > new file mode 100644
> > > > > > index 000..20fd3f0
> > > > > > --- /dev/null
> > > > > > +++ b/include/linux/mcslock.h
> > > > > > @@ -0,0 +1,58 @@
> > > > > > +/*
> > > > > > + * MCS lock defines
> > > > > > + *
> > > > > > + * This file contains the main data structure and API definitions 
> > > > > > of MCS lock.
> > > > > > + */
> > > > > > +#ifndef __LINUX_MCSLOCK_H
> > > > > > +#define __LINUX_MCSLOCK_H
> > > > > > +
> > > > > > +struct mcs_spin_node {
> > > > > > +   struct mcs_spin_node *next;
> > > > > > +   int   locked;   /* 1 if lock acquired */
> > > > > > +};
> > > > > > +
> > > > > > +/*
> > > > > > + * We don't inline mcs_spin_lock() so that perf can correctly 
> > > > > > account for the
> > > > > > + * time spent in this lock function.
> > > > > > + */
> > > > > > +static noinline
> > > > > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct 
> > > > > > mcs_spin_node *node)
> > > > > > +{
> > > > > > +   struct mcs_spin_node *prev;
> > > > > > +
> > > > > > +   /* Init node */
> > > > > > +   node->locked = 0;
> > > > > > +   node->next   = NULL;
> > > > > > +
> > > > > > +   prev = xchg(lock, node);
> > > > > > +   if (likely(prev == NULL)) {
> > > > > > +   /* Lock acquired */
> > > > > > +   node->locked = 1;
> > > > > 
> > > > > If we don't spin on the local node, is it necessary to set this 
> > > > > variable?
> > > > 
> > > > I don't follow, the whole idea is to spin on the local variable.
> > > 
> > > If prev == NULL, doesn't that mean it won't proceed to spin on the
> > > variable because the lock is already free and we call return? In that
> > > case where we directly acquire the lock, I was wondering if it is
> > > necessary to set node->locked = 1.
> > 
> > Yes, that's true, but we need to flag the lock as acquired (the node's
> > lock is initially set to unlocked), otherwise others trying to acquire
> > the lock can spin forever:
> > 
> > /* Wait until the lock holder passes the lock down */
> > while (!ACCESS_ONCE(node->locked))
> > arch_mutex_cpu_relax();
> > 
> > The ->locked variable in this implementation refers to if the lock is
> > acquired, and *not* to if busy-waiting is necessary.
> 
> hmm, others threads acquiring the lock will be spinning on their own
> local nodes, not this node's node->locked. And if prev == NULL, the
> current thread won't be reading it's node->lock either since we return.
> So what other thread is going to be reading this node's node->lock?
> 
> Thanks,
> Jason

I think setting node->locked = 1 for the prev==NULL case is not
necessary functionally, but was done for semantics consistency.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v7 6/6] rwsem: do optimistic spinning for writer lock acquisition

2013-09-26 Thread Tim Chen
We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Ingo Molnar 
Reviewed-by: Peter Zijlstra 
Reviewed-by: Peter Hurley 
Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/rwsem.h |7 ++-
 kernel/rwsem.c|   19 +-
 lib/rwsem.c   |  201 -
 3 files changed, 206 insertions(+), 21 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..aba7920 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -22,10 +22,13 @@ struct rw_semaphore;
 #include  /* use a generic implementation */
 #else
 /* All arch specific implementations share the same struct */
+struct mcs_spinlock;
 struct rw_semaphore {
longcount;
raw_spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner; /* write owner */
+   struct mcs_spinlock *mcs_lock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map  dep_map;
 #endif
@@ -58,7 +61,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)  \
{ RWSEM_UNLOCKED_VALUE, \
  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
- LIST_HEAD_INIT((name).wait_list)  \
+ LIST_HEAD_INIT((name).wait_list), \
+ NULL, \
+ NULL  \
  __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include 
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+   sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+   sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
int ret = __down_write_trylock(sem);
 
-   if (ret == 1)
+   if (ret == 1) {
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+   rwsem_set_owner(sem);
+   }
return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
__up_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 * dependency.
 */
__downgrade_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct 
lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int 
subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..cc3b33e 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->count = RWSEM_UNLOCKED_VALUE;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+   sem->owner = NULL;
+   sem->mcs_lock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,177 @@ struct rw_semaphore __sched 
*rwsem_down_read_failed(struct rw_semaphore *sem)
return sem;
 }
 
+static inline int rwsem_try_write_lock(long count,

[PATCH v7 0/6] rwsem: performance optimizations

2013-09-26 Thread Tim Chen
We fixed various style issues for version 7 of this patchset 
and added brief explanations of the MCS lock code that we put
in a separate file. We will like to have it merged if
there are no objections.

In this patchset, we introduce two categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The last two patches introduce similar optimistic spinning logic as the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With the patchset,
he got significant improvements to the aim7 suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

Tim Chen also got a +5% improvements to exim mail server workload on a
40 core system.

Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing
this patchset.

Regards,
Tim Chen

Changelog:

v7:
1. Rename mcslock.h to mcs_spinlock.h and also rename mcs related fields
with mcs prefix.
2. Properly define type of *mcs_lock field instead of leaving it as *void.
3. Added breif explanation of mcs lock.

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7. 
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the 
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be 
detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
own file
  rwsem: do optimistic spinning for writer lock acquisition

 include/asm-generic/rwsem.h  |8 +-
 include/linux/mcs_spinlock.h |   64 
 include/linux/mutex.h|5 +-
 include/linux/rwsem.h|7 +-
 kernel/mutex.c   |   60 ++--
 kernel/rwsem.c   |   19 -
 lib/rwsem.c  |  226 -
 7 files changed, 300 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

-- 
1.7.4.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v7 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-26 Thread Tim Chen
We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Reviewed-by: Ingo Molnar 
Reviewed-by: Peter Zijlstra 
Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/mcs_spinlock.h |   64 ++
 include/linux/mutex.h|5 ++-
 kernel/mutex.c   |   60 --
 3 files changed, 74 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/mcs_spinlock.h

diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
new file mode 100644
index 000..b5de3b0
--- /dev/null
+++ b/include/linux/mcs_spinlock.h
@@ -0,0 +1,64 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ *
+ * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
+ * with the desirable properties of being fair, and with each cpu trying
+ * to acquire the lock spinning on a local variable.
+ * It avoids expensive cache bouncings that common test-and-set spin-lock
+ * implementations incur.
+ */
+#ifndef __LINUX_MCS_SPINLOCK_H
+#define __LINUX_MCS_SPINLOCK_H
+
+struct mcs_spinlock {
+   struct mcs_spinlock *next;
+   int locked; /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
+{
+   struct mcs_spinlock *prev;
+
+   /* Init node */
+   node->locked = 0;
+   node->next   = NULL;
+
+   prev = xchg(lock, node);
+   if (likely(prev == NULL)) {
+   /* Lock acquired */
+   node->locked = 1;
+   return;
+   }
+   ACCESS_ONCE(prev->next) = node;
+   smp_wmb();
+   /* Wait until the lock holder passes the lock down */
+   while (!ACCESS_ONCE(node->locked))
+   arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock 
*node)
+{
+   struct mcs_spinlock *next = ACCESS_ONCE(node->next);
+
+   if (likely(!next)) {
+   /*
+* Release the lock by setting it to NULL
+*/
+   if (cmpxchg(lock, node, NULL) == node)
+   return;
+   /* Wait until the next pointer is set */
+   while (!(next = ACCESS_ONCE(node->next)))
+   arch_mutex_cpu_relax();
+   }
+   ACCESS_ONCE(next->locked) = 1;
+   smp_wmb();
+}
+
+#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ccd4260..e6eaeea 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,6 +46,7 @@
  * - detects multi-task circular deadlocks and prints out all affected
  *   locks and tasks (and only those tasks)
  */
+struct mcs_spinlock;
 struct mutex {
/* 1: unlocked, 0: locked, negative: locked, possible waiters */
atomic_tcount;
@@ -55,7 +56,7 @@ struct mutex {
struct task_struct  *owner;
 #endif
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-   void*spin_mlock;/* Spinner MCS lock */
+   struct mcs_spinlock *mcs_lock;  /* Spinner MCS lock */
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
const char  *name;
@@ -179,4 +180,4 @@ extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct 
mutex *lock);
 #define arch_mutex_cpu_relax() cpu_relax()
 #endif
 
-#endif
+#endif /* __LINUX_MUTEX_H */
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..4640731 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -52,7 +53,7 @@ __mutex_init(struct mutex *lock, const char *name, struct 
lock_class_key *key)
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-   lock->spin_mlock = NULL;
+   lock->mcs_lock = NULL;
 #endif
 
debug_mutex_init(lock, name, key);
@@ -111,54 +112,7 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-   struct mspin_node *next ;
-   int   locked;   /* 1 if lock acquired */
-};
-#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock))
-
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-   struct mspin_node *prev;
-
-   /* Init node */
-   node->

[PATCH v7 2/6] rwsem: remove 'out' label in do_wake

2013-09-26 Thread Tim Chen
That make code simple and more readable.

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 * will block as they will notice the queued writer.
 */
wake_up_process(waiter->task);
-   goto out;
+   return sem;
}
 
/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
/* A writer stole the lock. Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
-   goto out;
+   return sem;
/* Last active locker left. Retry waking readers. */
goto try_reader_grant;
}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
sem->wait_list.next = next;
next->prev = &sem->wait_list;
 
- out:
return sem;
 }
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v7 1/6] rwsem: check the lock before cpmxchg in down_write_trylock

2013-09-26 Thread Tim Chen
Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi 
---
 include/asm-generic/rwsem.h |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+   return 0;
 
-   tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
-   return tmp == RWSEM_UNLOCKED_VALUE;
+   return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v7 4/6] rwsem/wake: check lock before do atomic update

2013-09-26 Thread Tim Chen
Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso 
Suggested-by: Tim Chen 
Signed-off-by: Alex Shi 
---
 lib/rwsem.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
struct rwsem_waiter *waiter;
struct task_struct *tsk;
struct list_head *next;
-   long oldcount, woken, loop, adjustment;
+   long woken, loop, adjustment;
 
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
while (1) {
+   long oldcount;
+
+   /* A writer stole the lock. */
+   if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+   return sem;
+
oldcount = rwsem_atomic_update(adjustment, sem)
- adjustment;
if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v7 3/6] rwsem: remove try_reader_grant label do_wake

2013-09-26 Thread Tim Chen
That make code simple and more readable

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
adjustment = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-   oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-   /* A writer stole the lock. Undo our reader grant. */
+   while (1) {
+   oldcount = rwsem_atomic_update(adjustment, sem)
+   - adjustment;
+   if (likely(oldcount >= RWSEM_WAITING_BIAS))
+   break;
+
+/* A writer stole the lock.  Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
return sem;
/* Last active locker left. Retry waking readers. */
-   goto try_reader_grant;
}
}
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-26 Thread Tim Chen
On Thu, 2013-09-26 at 15:42 -0700, Jason Low wrote:
> On Thu, 2013-09-26 at 14:41 -0700, Tim Chen wrote:
> > On Thu, 2013-09-26 at 14:09 -0700, Jason Low wrote:
> > > On Thu, 2013-09-26 at 13:40 -0700, Davidlohr Bueso wrote:
> > > > On Thu, 2013-09-26 at 13:23 -0700, Jason Low wrote:
> > > > > On Thu, 2013-09-26 at 13:06 -0700, Davidlohr Bueso wrote:
> > > > > > On Thu, 2013-09-26 at 12:27 -0700, Jason Low wrote:
> > > > > > > On Wed, Sep 25, 2013 at 3:10 PM, Tim Chen 
> > > > > > >  wrote:
> > > > > > > > We will need the MCS lock code for doing optimistic spinning 
> > > > > > > > for rwsem.
> > > > > > > > Extracting the MCS code from mutex.c and put into its own file 
> > > > > > > > allow us
> > > > > > > > to reuse this code easily for rwsem.
> > > > > > > >
> > > > > > > > Signed-off-by: Tim Chen 
> > > > > > > > Signed-off-by: Davidlohr Bueso 
> > > > > > > > ---
> > > > > > > >  include/linux/mcslock.h |   58 
> > > > > > > > +++
> > > > > > > >  kernel/mutex.c  |   58 
> > > > > > > > +-
> > > > > > > >  2 files changed, 65 insertions(+), 51 deletions(-)
> > > > > > > >  create mode 100644 include/linux/mcslock.h
> > > > > > > >
> > > > > > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> > > > > > > > new file mode 100644
> > > > > > > > index 000..20fd3f0
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/include/linux/mcslock.h
> > > > > > > > @@ -0,0 +1,58 @@
> > > > > > > > +/*
> > > > > > > > + * MCS lock defines
> > > > > > > > + *
> > > > > > > > + * This file contains the main data structure and API 
> > > > > > > > definitions of MCS lock.
> > > > > > > > + */
> > > > > > > > +#ifndef __LINUX_MCSLOCK_H
> > > > > > > > +#define __LINUX_MCSLOCK_H
> > > > > > > > +
> > > > > > > > +struct mcs_spin_node {
> > > > > > > > +   struct mcs_spin_node *next;
> > > > > > > > +   int   locked;   /* 1 if lock acquired */
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * We don't inline mcs_spin_lock() so that perf can correctly 
> > > > > > > > account for the
> > > > > > > > + * time spent in this lock function.
> > > > > > > > + */
> > > > > > > > +static noinline
> > > > > > > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct 
> > > > > > > > mcs_spin_node *node)
> > > > > > > > +{
> > > > > > > > +   struct mcs_spin_node *prev;
> > > > > > > > +
> > > > > > > > +   /* Init node */
> > > > > > > > +   node->locked = 0;
> > > > > > > > +   node->next   = NULL;
> > > > > > > > +
> > > > > > > > +   prev = xchg(lock, node);
> > > > > > > > +   if (likely(prev == NULL)) {
> > > > > > > > +   /* Lock acquired */
> > > > > > > > +   node->locked = 1;
> > > > > > > 
> > > > > > > If we don't spin on the local node, is it necessary to set this 
> > > > > > > variable?
> > > > > > 
> > > > > > I don't follow, the whole idea is to spin on the local variable.
> > > > > 
> > > > > If prev == NULL, doesn't that mean it won't proceed to spin on the
> > > > > variable because the lock is already free and we call return? In that
> > > > > case where we directly acquire the lock, I was wondering if it is
> > > > > necessary to set node->locked = 1.
> > > > 
> > >

Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-27 Thread Tim Chen
On Fri, 2013-09-27 at 09:12 -0700, Jason Low wrote:
> On Fri, 2013-09-27 at 08:02 +0200, Ingo Molnar wrote:
> > Would be nice to have this as a separate, add-on patch. Every single 
> > instruction removal that has no downside is an upside!
> 
> Okay, so here is a patch. Tim, would you like to add this to v7?

Okay.  Will do.

Tim

> 
> ...
> Subject: MCS lock: Remove and reorder unnecessary assignments in 
> mcs_spin_lock()
> 
> In mcs_spin_lock(), if (likely(prev == NULL)) is true, then the lock is free
> and we won't spin on the local node. In that case, we don't have to assign
> node->locked because it won't be used. We can also move the node->locked = 0
> assignment so that it occurs after the if (likely(prev == NULL)) check.
> 
> This might also help make it clearer as to how the node->locked variable
> is used in MCS locks.
> 
> Signed-off-by: Jason Low 
> ---
>  include/linux/mcslock.h |3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> index 20fd3f0..1167d57 100644
> --- a/include/linux/mcslock.h
> +++ b/include/linux/mcslock.h
> @@ -21,15 +21,14 @@ void mcs_spin_lock(struct mcs_spin_node **lock, struct 
> mcs_spin_node *node)
>   struct mcs_spin_node *prev;
>  
>   /* Init node */
> - node->locked = 0;
>   node->next   = NULL;
>  
>   prev = xchg(lock, node);
>   if (likely(prev == NULL)) {
>   /* Lock acquired */
> - node->locked = 1;
>   return;
>   }
> + node->locked = 0;
>   ACCESS_ONCE(prev->next) = node;
>   smp_wmb();
>   /* Wait until the lock holder passes the lock down */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-27 Thread Tim Chen
On Fri, 2013-09-27 at 08:29 -0700, Paul E. McKenney wrote:
> On Wed, Sep 25, 2013 at 03:10:49PM -0700, Tim Chen wrote:
> > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > Extracting the MCS code from mutex.c and put into its own file allow us
> > to reuse this code easily for rwsem.
> > 
> > Signed-off-by: Tim Chen 
> > Signed-off-by: Davidlohr Bueso 
> > ---
> >  include/linux/mcslock.h |   58 
> > +++
> >  kernel/mutex.c  |   58 
> > +-
> >  2 files changed, 65 insertions(+), 51 deletions(-)
> >  create mode 100644 include/linux/mcslock.h
> > 
> > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> > new file mode 100644
> > index 000..20fd3f0
> > --- /dev/null
> > +++ b/include/linux/mcslock.h
> > @@ -0,0 +1,58 @@
> > +/*
> > + * MCS lock defines
> > + *
> > + * This file contains the main data structure and API definitions of MCS 
> > lock.
> > + */
> > +#ifndef __LINUX_MCSLOCK_H
> > +#define __LINUX_MCSLOCK_H
> > +
> > +struct mcs_spin_node {
> > +   struct mcs_spin_node *next;
> > +   int   locked;   /* 1 if lock acquired */
> > +};
> > +
> > +/*
> > + * We don't inline mcs_spin_lock() so that perf can correctly account for 
> > the
> > + * time spent in this lock function.
> > + */
> > +static noinline
> > +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node)
> > +{
> > +   struct mcs_spin_node *prev;
> > +
> > +   /* Init node */
> > +   node->locked = 0;
> > +   node->next   = NULL;
> > +
> > +   prev = xchg(lock, node);
> > +   if (likely(prev == NULL)) {
> > +   /* Lock acquired */
> > +   node->locked = 1;
> > +   return;
> > +   }
> > +   ACCESS_ONCE(prev->next) = node;
> > +   smp_wmb();
> > +   /* Wait until the lock holder passes the lock down */
> > +   while (!ACCESS_ONCE(node->locked))
> > +   arch_mutex_cpu_relax();
> > +}
> > +
> > +static void mcs_spin_unlock(struct mcs_spin_node **lock, struct 
> > mcs_spin_node *node)
> > +{
> > +   struct mcs_spin_node *next = ACCESS_ONCE(node->next);
> > +
> > +   if (likely(!next)) {
> > +   /*
> > +* Release the lock by setting it to NULL
> > +*/
> > +   if (cmpxchg(lock, node, NULL) == node)
> > +   return;
> > +   /* Wait until the next pointer is set */
> > +   while (!(next = ACCESS_ONCE(node->next)))
> > +   arch_mutex_cpu_relax();
> > +   }
> > +   ACCESS_ONCE(next->locked) = 1;
> > +   smp_wmb();
> 
> Shouldn't the memory barrier precede the "ACCESS_ONCE(next->locked) = 1;"?
> Maybe in an "else" clause of the prior "if" statement, given that the
> cmpxchg() does it otherwise.
> 
> Otherwise, in the case where the "if" conditionn is false, the critical
> section could bleed out past the unlock.

Yes, I agree with you that the smp_wmb should be moved before
ACCESS_ONCE to prevent critical section from bleeding.  Copying Waiman
who is the original author of the mcs code to see if he has any comments
on things we may have missed.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-27 Thread Tim Chen
On Fri, 2013-09-27 at 08:29 -0700, Paul E. McKenney wrote:
> On Wed, Sep 25, 2013 at 03:10:49PM -0700, Tim Chen wrote:
> > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > Extracting the MCS code from mutex.c and put into its own file allow us
> > to reuse this code easily for rwsem.
> > 
> > Signed-off-by: Tim Chen 
> > Signed-off-by: Davidlohr Bueso 
> > ---
> >  include/linux/mcslock.h |   58 
> > +++
> >  kernel/mutex.c  |   58 
> > +-
> >  2 files changed, 65 insertions(+), 51 deletions(-)
> >  create mode 100644 include/linux/mcslock.h
> > 
> > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> > new file mode 100644
> > index 000..20fd3f0
> > --- /dev/null
> > +++ b/include/linux/mcslock.h
> > @@ -0,0 +1,58 @@
> > +/*
> > + * MCS lock defines
> > + *
> > + * This file contains the main data structure and API definitions of MCS 
> > lock.
> > + */
> > +#ifndef __LINUX_MCSLOCK_H
> > +#define __LINUX_MCSLOCK_H
> > +
> > +struct mcs_spin_node {
> > +   struct mcs_spin_node *next;
> > +   int   locked;   /* 1 if lock acquired */
> > +};
> > +
> > +/*
> > + * We don't inline mcs_spin_lock() so that perf can correctly account for 
> > the
> > + * time spent in this lock function.
> > + */
> > +static noinline
> > +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node)
> > +{
> > +   struct mcs_spin_node *prev;
> > +
> > +   /* Init node */
> > +   node->locked = 0;
> > +   node->next   = NULL;
> > +
> > +   prev = xchg(lock, node);
> > +   if (likely(prev == NULL)) {
> > +   /* Lock acquired */
> > +   node->locked = 1;
> > +   return;
> > +   }
> > +   ACCESS_ONCE(prev->next) = node;
> > +   smp_wmb();

BTW, is the above memory barrier necessary?  It seems like the xchg
instruction already provided a memory barrier.

Now if we made the changes that Jason suggested:


/* Init node */
-   node->locked = 0;
node->next   = NULL;
 
prev = xchg(lock, node);
if (likely(prev == NULL)) {
/* Lock acquired */
-   node->locked = 1;
return;
}
+   node->locked = 0;
ACCESS_ONCE(prev->next) = node;
smp_wmb();

We are probably still okay as other cpus do not read the value of
node->locked, which is a local variable.

Tim

> > +   /* Wait until the lock holder passes the lock down */
> > +   while (!ACCESS_ONCE(node->locked))
> > +   arch_mutex_cpu_relax();
> > +}
> > +
> > +static void mcs_spin_unlock(struct mcs_spin_node **lock, struct 
> > mcs_spin_node *node)
> > +{
> > +   struct mcs_spin_node *next = ACCESS_ONCE(node->next);
> > +
> > +   if (likely(!next)) {
> > +   /*
> > +* Release the lock by setting it to NULL
> > +*/
> > +   if (cmpxchg(lock, node, NULL) == node)
> > +   return;
> > +   /* Wait until the next pointer is set */
> > +   while (!(next = ACCESS_ONCE(node->next)))
> > +   arch_mutex_cpu_relax();
> > +   }
> > +   ACCESS_ONCE(next->locked) = 1;
> > +   smp_wmb();
> 
> Shouldn't the memory barrier precede the "ACCESS_ONCE(next->locked) = 1;"?
> Maybe in an "else" clause of the prior "if" statement, given that the
> cmpxchg() does it otherwise.
> 
> Otherwise, in the case where the "if" conditionn is false, the critical
> section could bleed out past the unlock.
> 
>   Thanx, Paul
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rwsem: reduce spinlock contention in wakeup code path

2013-09-27 Thread Tim Chen
On Fri, 2013-09-27 at 12:39 -0700, Davidlohr Bueso wrote:
> On Fri, 2013-09-27 at 12:28 -0700, Linus Torvalds wrote:
> > On Fri, Sep 27, 2013 at 12:00 PM, Waiman Long  wrote:
> > >
> > > On a large NUMA machine, it is entirely possible that a fairly large
> > > number of threads are queuing up in the ticket spinlock queue to do
> > > the wakeup operation. In fact, only one will be needed.  This patch
> > > tries to reduce spinlock contention by doing just that.
> > >
> > > A new wakeup field is added to the rwsem structure. This field is
> > > set on entry to rwsem_wake() and __rwsem_do_wake() to mark that a
> > > thread is pending to do the wakeup call. It is cleared on exit from
> > > those functions.
> > 
> > Ok, this is *much* simpler than adding the new MCS spinlock, so I'm
> > wondering what the performance difference between the two are.
> 
> Both approaches should be complementary. The idea of optimistic spinning
> in rwsems is to avoid putting putting the writer on the wait queue -
> reducing contention and giving a greater chance for the rwsem
> to get acquired. Waiman's approach is once the blocking actually occurs,
> and at this point I'm not sure how this will affect writer stealing
> logic.
> 

I agree with the view that the two approaches are complementary 
to each other.  They address different bottleneck areas in the
rwsem. Here're the performance numbers for exim workload 
compared to a vanilla kernel.

Waimain's patch:+2.0%
Alex+Tim's patchset:+4.8%
Waiman+Alex+Tim:+5.3%

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-27 Thread Tim Chen
On Fri, 2013-09-27 at 13:38 -0700, Paul E. McKenney wrote:
> On Fri, Sep 27, 2013 at 12:38:53PM -0700, Tim Chen wrote:
> > On Fri, 2013-09-27 at 08:29 -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 25, 2013 at 03:10:49PM -0700, Tim Chen wrote:
> > > > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > > > Extracting the MCS code from mutex.c and put into its own file allow us
> > > > to reuse this code easily for rwsem.
> > > > 
> > > > Signed-off-by: Tim Chen 
> > > > Signed-off-by: Davidlohr Bueso 
> > > > ---
> > > >  include/linux/mcslock.h |   58 
> > > > +++
> > > >  kernel/mutex.c  |   58 
> > > > +-
> > > >  2 files changed, 65 insertions(+), 51 deletions(-)
> > > >  create mode 100644 include/linux/mcslock.h
> > > > 
> > > > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> > > > new file mode 100644
> > > > index 000..20fd3f0
> > > > --- /dev/null
> > > > +++ b/include/linux/mcslock.h
> > > > @@ -0,0 +1,58 @@
> > > > +/*
> > > > + * MCS lock defines
> > > > + *
> > > > + * This file contains the main data structure and API definitions of 
> > > > MCS lock.
> > > > + */
> > > > +#ifndef __LINUX_MCSLOCK_H
> > > > +#define __LINUX_MCSLOCK_H
> > > > +
> > > > +struct mcs_spin_node {
> > > > +   struct mcs_spin_node *next;
> > > > +   int   locked;   /* 1 if lock acquired */
> > > > +};
> > > > +
> > > > +/*
> > > > + * We don't inline mcs_spin_lock() so that perf can correctly account 
> > > > for the
> > > > + * time spent in this lock function.
> > > > + */
> > > > +static noinline
> > > > +void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node 
> > > > *node)
> > > > +{
> > > > +   struct mcs_spin_node *prev;
> > > > +
> > > > +   /* Init node */
> > > > +   node->locked = 0;
> > > > +   node->next   = NULL;
> > > > +
> > > > +   prev = xchg(lock, node);
> > > > +   if (likely(prev == NULL)) {
> > > > +   /* Lock acquired */
> > > > +   node->locked = 1;
> > > > +   return;
> > > > +   }
> > > > +   ACCESS_ONCE(prev->next) = node;
> > > > +   smp_wmb();
> > 
> > BTW, is the above memory barrier necessary?  It seems like the xchg
> > instruction already provided a memory barrier.
> > 
> > Now if we made the changes that Jason suggested:
> > 
> > 
> > /* Init node */
> > -   node->locked = 0;
> > node->next   = NULL;
> > 
> > prev = xchg(lock, node);
> > if (likely(prev == NULL)) {
> > /* Lock acquired */
> > -   node->locked = 1;
> > return;
> > }
> > +   node->locked = 0;
> > ACCESS_ONCE(prev->next) = node;
> > smp_wmb();
> > 
> > We are probably still okay as other cpus do not read the value of
> > node->locked, which is a local variable.
> 
> I don't immediately see the need for the smp_wmb() in either case.


Thinking a bit more, the following could happen in Jason's 
initial patch proposal.  In this case variable "prev" referenced 
by CPU1 points to "node" referenced by CPU2  

CPU 1 (calling lock)CPU 2 (calling unlock)
ACCESS_ONCE(prev->next) = node
*next = ACCESS_ONCE(node->next);
ACCESS_ONCE(next->locked) = 1;
node->locked = 0;

Then we will be spinning forever on CPU1 as we overwrite the lock passed
from CPU2 before we check it.  The original code assign 
"node->locked = 0" before xchg does not have this issue.
Doing the following change of moving smp_wmb immediately
after node->locked assignment (suggested by Jason)

node->locked = 0;
smp_wmb();
ACCESS_ONCE(prev->next) = node;

could avoid the problem, but will need closer scrutiny to see if
there are other pitfalls if wmb happen before 

ACCESS_ONCE(prev->next) = node;


> > 
> > > > +   /* Wait until the lock holder passes the lock down */
> > > > +   while (!ACCESS_ONCE(node->locked))
> > > > +   arch_mutex_cpu_relax();
> 
> However, you do need a full memory barrier here in order to ensure that
> you see the effects of the previous lock holder's critical section.

Is it necessary to add a memory barrier after acquiring
the lock if the previous lock holder execute smp_wmb before passing
the lock?

Thanks.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-30 Thread Tim Chen
On Fri, 2013-09-27 at 19:19 -0700, Paul E. McKenney wrote:
> On Fri, Sep 27, 2013 at 04:54:06PM -0700, Jason Low wrote:
> > On Fri, Sep 27, 2013 at 4:01 PM, Paul E. McKenney
> >  wrote:
> > > Yep.  The previous lock holder's smp_wmb() won't keep either the compiler
> > > or the CPU from reordering things for the new lock holder.  They could for
> > > example reorder the critical section to precede the node->locked check,
> > > which would be very bad.
> > 
> > Paul, Tim, Longman,
> > 
> > How would you like the proposed changes below?
> 
> Could you point me at what this applies to?  I can find flaws looking
> at random pieces, given a little luck, but at some point I need to look
> at the whole thing.  ;-)
> 
>   Thanx, Paul

Jason's patch is on top of the following patchset:

https://lkml.org/lkml/2013/9/26/674

Thanks.

Tim

> 
> > ---
> > Subject: [PATCH] MCS: optimizations and barrier corrections
> > 
> > Delete the node->locked = 1 assignment if the lock is free as it won't be 
> > used.
> > 
> > Delete the smp_wmb() in mcs_spin_lock() and add a full memory barrier at the
> > end of the mcs_spin_lock() function. As Paul McKenney suggested, "you do 
> > need a
> > full memory barrier here in order to ensure that you see the effects of the
> > previous lock holder's critical section." And in the mcs_spin_unlock(), 
> > move the
> > memory barrier so that it is before the "ACCESS_ONCE(next->locked) = 1;".
> > 
> > Signed-off-by: Jason Low 
> > Signed-off-by: Paul E. McKenney 
> > Signed-off-by: Tim Chen 
> > ---
> >  include/linux/mcslock.h |7 +++
> >  1 files changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
> > index 20fd3f0..edd57d2 100644
> > --- a/include/linux/mcslock.h
> > +++ b/include/linux/mcslock.h
> > @@ -26,15 +26,14 @@ void mcs_spin_lock(struct mcs_spin_node **lock,
> > struct mcs_spin_node *node)
> > 
> > prev = xchg(lock, node);
> > if (likely(prev == NULL)) {
> > -   /* Lock acquired */
> > -   node->locked = 1;
> > +   /* Lock acquired. No need to set node->locked since it
> > won't be used */
> > return;
> > }
> > ACCESS_ONCE(prev->next) = node;
> > -   smp_wmb();
> > /* Wait until the lock holder passes the lock down */
> > while (!ACCESS_ONCE(node->locked))
> > arch_mutex_cpu_relax();
> > +   smp_mb();
> >  }
> > 
> >  static void mcs_spin_unlock(struct mcs_spin_node **lock, struct
> > mcs_spin_node *node)
> > @@ -51,8 +50,8 @@ static void mcs_spin_unlock(struct mcs_spin_node
> > **lock, struct mcs_spin_node *n
> > while (!(next = ACCESS_ONCE(node->next)))
> > arch_mutex_cpu_relax();
> > }
> > -   ACCESS_ONCE(next->locked) = 1;
> > smp_wmb();
> > +   ACCESS_ONCE(next->locked) = 1;
> >  }
> > 
> >  #endif
> > -- 
> > 1.7.1
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t

2013-09-30 Thread Tim Chen
On Sat, 2013-09-28 at 21:52 +0200, Ingo Molnar wrote:
> * Linus Torvalds  wrote:
> 
> > On Sat, Sep 28, 2013 at 12:37 PM, Ingo Molnar  wrote:
> > >
> > > -   down_write_nest_lock(&anon_vma->root->rwsem, 
> > > &mm->mmap_sem);
> > > +   down_write_nest_lock(&anon_vma->root->rwlock, 
> > > &mm->mmap_sem);
> > 
> > That's just completely bogus, and cannot work.
> > 
> > Maybe just a "write_lock(&anon_vma->root->rwlock)" (which is just
> > anon_vma_unlock_write(anon_vma)). But I think we might have a lockdep
> > issue. I'm not quite sure what's up with the nesting there.
> > 
> > > -   if (rwsem_is_locked(&anon_vma->root->rwsem)) {
> > > +   if (write_can_lock(&anon_vma->root->rwlock)) {
> > > anon_vma_lock_write(anon_vma);
> > > anon_vma_unlock_write(anon_vma);
> > > }
> > 
> > That's the wrong way around. It should be
> > 
> > if (!write_can_lock(&anon_vma->root->rwlock)) {
> > 
> > so some more testing definitely needed.
> 
> Yeah, that silly API asymmetry has bitten me before as well :-/
> 
> The attached version booted up fine under 16-way KVM:
> 
>  sh-4.2# uptime
>   19:50:08 up 0 min,  0 users,  load average: 0.00, 0.00, 0.00
> 
> That's all the testing it will get this evening though. Patch should be 
> good enough for Tim to try?


Here's the exim workload data:

rwsem improvment:
Waimain's patch:+2.0%
Alex+Tim's patchset:+4.8%
Waiman+Alex+Tim:+5.3%

convert rwsem to rwlock_t for root anon_vma lock
Ingo's patch+11.7%

Yes, changing the anon-vma root lock to rwlock_t gives the most boost.

However, I think that Waiman's patches, Alex's patches and my
patches can still be combined together to improve scalability of 
rwsem, even if we should decide to convert this particular rwsem to
rwlock_t.

Thanks.

Tim

> 
> Thanks,
> 
>   Ingo
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t

2013-09-30 Thread Tim Chen
On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> > Here's the exim workload data:
> > 
> > rwsem improvment:
> > Waimain's patch:+2.0%
> > Alex+Tim's patchset:+4.8%
> > Waiman+Alex+Tim:+5.3%
> > 
> > convert rwsem to rwlock_t for root anon_vma lock
> > Ingo's patch+11.7%
> > 
> 
> What happens if you stuff Waiman's qrwlock patches on top of that?
> admittedly and oft mentioned in this thread, our current rwlock_t is
> somewhat suboptimal under a number of conditions.

I've tested with Waiman's qrwlock patches on top of Ingo's patches.
It does not affect the throughput for exim and I still get
about +11.7% throughput change (same as with Ingo's patch only).

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t

2013-09-30 Thread Tim Chen
On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote:
> On 09/30/2013 03:23 PM, Tim Chen wrote:
> > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> >>> Here's the exim workload data:
> >>>
> >>> rwsem improvment:
> >>> Waimain's patch:+2.0%
> >>> Alex+Tim's patchset:+4.8%
> >>> Waiman+Alex+Tim:+5.3%
> >>>
> >>> convert rwsem to rwlock_t for root anon_vma lock
> >>> Ingo's patch  +11.7%
> >>>
> >> What happens if you stuff Waiman's qrwlock patches on top of that?
> >> admittedly and oft mentioned in this thread, our current rwlock_t is
> >> somewhat suboptimal under a number of conditions.
> > I've tested with Waiman's qrwlock patches on top of Ingo's patches.
> > It does not affect the throughput for exim and I still get
> > about +11.7% throughput change (same as with Ingo's patch only).
> >
> > Tim
> >
> 
> My qrwlock doesn't enable qrwlock by default. You have to use menuconfig 
> to explicitly enable it. Have you done that when you build the test 
> kernel? I am thinking of explicitly enabling it for x86 if the anon-vma 
> lock is converted back to a rwlock.
> 

Yes, I have explicitly enabled it during my testing.

Thanks.
Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t

2013-09-30 Thread Tim Chen
On Mon, 2013-09-30 at 12:47 -0700, Tim Chen wrote:
> On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote:
> > On 09/30/2013 03:23 PM, Tim Chen wrote:
> > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> > >>> Here's the exim workload data:
> > >>>
> > >>> rwsem improvment:
> > >>> Waimain's patch:+2.0%
> > >>> Alex+Tim's patchset:+4.8%
> > >>> Waiman+Alex+Tim:+5.3%
> > >>>
> > >>> convert rwsem to rwlock_t for root anon_vma lock
> > >>> Ingo's patch+11.7%
> > >>>
> > >> What happens if you stuff Waiman's qrwlock patches on top of that?
> > >> admittedly and oft mentioned in this thread, our current rwlock_t is
> > >> somewhat suboptimal under a number of conditions.
> > > I've tested with Waiman's qrwlock patches on top of Ingo's patches.
> > > It does not affect the throughput for exim and I still get
> > > about +11.7% throughput change (same as with Ingo's patch only).
> > >
> > > Tim
> > >
> > 
> > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig 
> > to explicitly enable it. Have you done that when you build the test 
> > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma 
> > lock is converted back to a rwlock.
> > 
> 
> Yes, I have explicitly enabled it during my testing.
> 
The workload I have is dominated by writer locks, with only very
occasional readers. So it may not benefit from the various tweaks you
have put in qrwlock.

Tim 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 0/6] rwsem: performance optimizations

2013-09-24 Thread Tim Chen
We have incorporated various suggestions from Ingo for version 5 of this 
patchset
and will like to have it merged if there are no objections.

In this patchset, we introduce two categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache bouncing of 
the
sem->count field by doing a pre-read of the sem->count and avoid cmpxchg
if possible.

The last two patches introduce similar optimistic spinning logic as
the mutex code for the writer lock acquisition of rwsem.

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex to
rwsem.  Tests were on 8 socket 80 cores system.  With the patchset, he
get significant improvements to the aim7 suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

Tim Chen also got a +5% improvements to exim mail server workload on a 40 core 
system.

Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing this 
patchset.

Regards,
Tim Chen

Changelog:

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7. 
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the 
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be 
detrimental.
3. Various patch comments update



Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
own file
  rwsem: do optimistic spinning for writer lock acquisition

 include/asm-generic/rwsem.h |8 +-
 include/linux/rwsem.h   |8 +-
 kernel/mutex.c  |   58 ++--
 kernel/rwsem.c  |   19 -
 lib/rwsem.c |  218 ++-
 5 files changed, 228 insertions(+), 83 deletions(-)

-- 
1.7.4.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 3/6] rwsem: remove try_reader_grant label do_wake

2013-09-24 Thread Tim Chen
That make code simple and more readable

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
adjustment = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-   oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-   /* A writer stole the lock. Undo our reader grant. */
+   while (1) {
+   oldcount = rwsem_atomic_update(adjustment, sem)
+   - adjustment;
+   if (likely(oldcount >= RWSEM_WAITING_BIAS))
+   break;
+
+/* A writer stole the lock.  Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
return sem;
/* Last active locker left. Retry waking readers. */
-   goto try_reader_grant;
}
}
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 4/6] rwsem/wake: check lock before do atomic update

2013-09-24 Thread Tim Chen
Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso 
Suggested-by: Tim Chen 
Signed-off-by: Alex Shi 
---
 lib/rwsem.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
struct rwsem_waiter *waiter;
struct task_struct *tsk;
struct list_head *next;
-   long oldcount, woken, loop, adjustment;
+   long woken, loop, adjustment;
 
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
while (1) {
+   long oldcount;
+
+   /* A writer stole the lock. */
+   if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+   return sem;
+
oldcount = rwsem_atomic_update(adjustment, sem)
- adjustment;
if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 6/6] rwsem: do optimistic spinning for writer lock acquisition

2013-09-24 Thread Tim Chen
We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Peter Zijlstra 
Reviewed-by: Peter Hurley 
Reviewed-by: Ingo Molnar 
Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/rwsem.h |8 ++-
 kernel/rwsem.c|   19 +-
 lib/rwsem.c   |  193 +
 3 files changed, 201 insertions(+), 19 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..4dfc6ae 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -29,6 +29,8 @@ struct rw_semaphore {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map  dep_map;
 #endif
+   struct task_struct  *owner;
+   void*spin_mlock;
 };
 
 extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
@@ -58,8 +60,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)  \
{ RWSEM_UNLOCKED_VALUE, \
  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
- LIST_HEAD_INIT((name).wait_list)  \
- __RWSEM_DEP_MAP_INIT(name) }
+ LIST_HEAD_INIT((name).wait_list), \
+ __RWSEM_DEP_MAP_INIT(name)\
+ NULL, \
+ NULL }
 
 #define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include 
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+   sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+   sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
int ret = __down_write_trylock(sem);
 
-   if (ret == 1)
+   if (ret == 1) {
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+   rwsem_set_owner(sem);
+   }
return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
__up_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 * dependency.
 */
__downgrade_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct 
lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int 
subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..1752ec9 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->count = RWSEM_UNLOCKED_VALUE;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+   sem->owner = NULL;
+   sem->spin_mlock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,177 @@ struct rw_semaphore __sched 
*rwsem_down_read_failed(struct rw_semaphore *sem)
return sem;
 }
 
+static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+   if (!(count & RWSEM_ACTIVE_MASK)) {
+   /* Try acquiring the write lock. */
+   if (se

[PATCH v5 2/6] rwsem: remove 'out' label in do_wake

2013-09-24 Thread Tim Chen
That make code simple and more readable.

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 * will block as they will notice the queued writer.
 */
wake_up_process(waiter->task);
-   goto out;
+   return sem;
}
 
/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
/* A writer stole the lock. Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
-   goto out;
+   return sem;
/* Last active locker left. Retry waking readers. */
goto try_reader_grant;
}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
sem->wait_list.next = next;
next->prev = &sem->wait_list;
 
- out:
return sem;
 }
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 1/6] rwsem: check the lock before cpmxchg in down_write_trylock

2013-09-24 Thread Tim Chen
Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi 
---
 include/asm-generic/rwsem.h |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+   return 0;
 
-   tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
-   return tmp == RWSEM_UNLOCKED_VALUE;
+   return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-24 Thread Tim Chen
We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 kernel/mutex.c |   58 ++-
 1 files changed, 7 insertions(+), 51 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..1b6ba3f 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -111,54 +112,9 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-   struct mspin_node *next ;
-   int   locked;   /* 1 if lock acquired */
-};
-#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock))
 
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-   struct mspin_node *prev;
-
-   /* Init node */
-   node->locked = 0;
-   node->next   = NULL;
-
-   prev = xchg(lock, node);
-   if (likely(prev == NULL)) {
-   /* Lock acquired */
-   node->locked = 1;
-   return;
-   }
-   ACCESS_ONCE(prev->next) = node;
-   smp_wmb();
-   /* Wait until the lock holder passes the lock down */
-   while (!ACCESS_ONCE(node->locked))
-   arch_mutex_cpu_relax();
-}
-
-static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
-{
-   struct mspin_node *next = ACCESS_ONCE(node->next);
-
-   if (likely(!next)) {
-   /*
-* Release the lock by setting it to NULL
-*/
-   if (cmpxchg(lock, node, NULL) == node)
-   return;
-   /* Wait until the next pointer is set */
-   while (!(next = ACCESS_ONCE(node->next)))
-   arch_mutex_cpu_relax();
-   }
-   ACCESS_ONCE(next->locked) = 1;
-   smp_wmb();
-}
+#defineMLOCK(mutex)((struct mcs_spin_node 
**)&((mutex)->spin_mlock))
 
 /*
  * Mutex spinning code migrated from kernel/sched/core.c
@@ -448,7 +404,7 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 
for (;;) {
struct task_struct *owner;
-   struct mspin_node  node;
+   struct mcs_spin_node  node;
 
if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 
0) {
struct ww_mutex *ww;
@@ -470,10 +426,10 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 * If there's an owner, wait for it to either
 * release the lock or go to sleep.
 */
-   mspin_lock(MLOCK(lock), &node);
+   mcs_spin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
if (owner && !mutex_spin_on_owner(lock, owner)) {
-   mspin_unlock(MLOCK(lock), &node);
+   mcs_spin_unlock(MLOCK(lock), &node);
goto slowpath;
}
 
@@ -488,11 +444,11 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
}
 
mutex_set_owner(lock);
-   mspin_unlock(MLOCK(lock), &node);
+   mcs_spin_unlock(MLOCK(lock), &node);
preempt_enable();
return 0;
}
-   mspin_unlock(MLOCK(lock), &node);
+   mcs_spin_unlock(MLOCK(lock), &node);
 
/*
 * When there's no owner, we might have preempted between the
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 1/6] rwsem: check the lock before cpmxchg in down_write_trylock

2013-09-24 Thread Tim Chen
On Tue, 2013-09-24 at 16:22 -0700, Jason Low wrote:
> Should we do something similar with __down_read_trylock, such as
> the following?
> 
> 
> Signed-off-by: Jason Low 
> ---
>  include/asm-generic/rwsem.h |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> index bb1e2cd..47990dc 100644
> --- a/include/asm-generic/rwsem.h
> +++ b/include/asm-generic/rwsem.h
> @@ -42,6 +42,9 @@ static inline int __down_read_trylock(struct
> rw_semaphore *sem)
> long tmp;
> 
> while ((tmp = sem->count) >= 0) {
> +   if (sem->count != tmp)
> +   continue;
> +

Considering that tmp has just been assigned the value of sem->count, the
added if check failure is unlikely and probably not needed.  We should
proceed to cmpxchg below.

> if (tmp == cmpxchg(&sem->count, tmp,
>tmp + RWSEM_ACTIVE_READ_BIAS)) {
> return 1;

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-25 Thread Tim Chen
On Wed, 2013-09-25 at 07:55 +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2013 at 03:22:46PM -0700, Tim Chen wrote:
> > We will need the MCS lock code for doing optimistic spinning for rwsem.
> > Extracting the MCS code from mutex.c and put into its own file allow us
> > to reuse this code easily for rwsem.
> > 
> > Signed-off-by: Tim Chen 
> > Signed-off-by: Davidlohr Bueso 
> > ---
> >  kernel/mutex.c |   58 
> > ++-
> >  1 files changed, 7 insertions(+), 51 deletions(-)
> 
> Wasn't this patch supposed to add include/linux/mcslock.h ?

Thanks for catching it.  I will correct it.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 0/6] rwsem: performance optimizations

2013-09-25 Thread Tim Chen
We fixed a missing file and fixed various style issues for
version 6 of this patchset. We will like to have it merged if
there are no objections.

In this patchset, we introduce two categories of optimizations to read
write semaphore.  The first four patches from Alex Shi reduce cache
bouncing of the sem->count field by doing a pre-read of the sem->count
and avoid cmpxchg if possible.

The last two patches introduce similar optimistic spinning logic as the
mutex code for the writer lock acquisition of rwsem. This addresses the
general 'mutexes out perform writer-rwsems' situations that has been
seen in more than one case.  Users now need not worry about performance
issues when choosing between these two locking mechanisms.

Without these optimizations, Davidlohr Bueso saw a -8% regression to
aim7's shared and high_systime workloads when he switched i_mmap_mutex
to rwsem.  Tests were on 8 socket 80 cores system.  With the patchset,
he got significant improvements to the aim7 suite instead of regressions:

alltests (+16.3%), custom (+20%), disk (+19.5%), high_systime (+7%),
shared (+18.4%) and short (+6.3%).

Tim Chen also got a +5% improvements to exim mail server workload on a
40 core system.

Thanks to Ingo Molnar, Peter Hurley and Peter Zijlstra for reviewing
this patchset.

Regards,
Tim Chen

Changelog:

v6:
1. Fix missing mcslock.h file.
2. Fix various code style issues.

v5:
1. Try optimistic spinning before we put the writer on the wait queue
to avoid bottlenecking at wait queue.  This provides 5% boost to exim workload
and between 2% to 8% boost to aim7. 
2. Put MCS locking code into its own mcslock.h file for better reuse
between mutex.c and rwsem.c
3. Remove the configuration RWSEM_SPIN_ON_WRITE_OWNER and make the 
operations default per Ingo's suggestions.

v4:
1. Fixed a bug in task_struct definition in rwsem_can_spin_on_owner
2. Fix another typo for RWSEM_SPIN_ON_WRITE_OWNER config option

v3:
1. Added ACCESS_ONCE to sem->count access in rwsem_can_spin_on_owner.
2. Fix typo bug for RWSEM_SPIN_ON_WRITE_OWNER option in init/Kconfig

v2:
1. Reorganize changes to down_write_trylock and do_wake into 4 patches and fixed
   a bug referencing &sem->count when sem->count is intended.
2. Fix unsafe sem->owner de-reference in rwsem_can_spin_on_owner.
the option to be on for more seasoning but can be turned off should it be 
detrimental.
3. Various patch comments update

Alex Shi (4):
  rwsem: check the lock before cpmxchg in down_write_trylock
  rwsem: remove 'out' label in do_wake
  rwsem: remove try_reader_grant label do_wake
  rwsem/wake: check lock before do atomic update

Tim Chen (2):
  MCS Lock: Restructure the MCS lock defines and locking code into its
own file
  rwsem: do optimistic spinning for writer lock acquisition

 include/asm-generic/rwsem.h |8 +-
 include/linux/mcslock.h |   58 +++
 include/linux/rwsem.h   |6 +-
 kernel/mutex.c  |   58 ++--
 kernel/rwsem.c  |   19 -
 lib/rwsem.c |  228 +-
 6 files changed, 292 insertions(+), 85 deletions(-)
 create mode 100644 include/linux/mcslock.h

-- 
1.7.4.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 5/6] MCS Lock: Restructure the MCS lock defines and locking code into its own file

2013-09-25 Thread Tim Chen
We will need the MCS lock code for doing optimistic spinning for rwsem.
Extracting the MCS code from mutex.c and put into its own file allow us
to reuse this code easily for rwsem.

Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/mcslock.h |   58 +++
 kernel/mutex.c  |   58 +-
 2 files changed, 65 insertions(+), 51 deletions(-)
 create mode 100644 include/linux/mcslock.h

diff --git a/include/linux/mcslock.h b/include/linux/mcslock.h
new file mode 100644
index 000..20fd3f0
--- /dev/null
+++ b/include/linux/mcslock.h
@@ -0,0 +1,58 @@
+/*
+ * MCS lock defines
+ *
+ * This file contains the main data structure and API definitions of MCS lock.
+ */
+#ifndef __LINUX_MCSLOCK_H
+#define __LINUX_MCSLOCK_H
+
+struct mcs_spin_node {
+   struct mcs_spin_node *next;
+   int   locked;   /* 1 if lock acquired */
+};
+
+/*
+ * We don't inline mcs_spin_lock() so that perf can correctly account for the
+ * time spent in this lock function.
+ */
+static noinline
+void mcs_spin_lock(struct mcs_spin_node **lock, struct mcs_spin_node *node)
+{
+   struct mcs_spin_node *prev;
+
+   /* Init node */
+   node->locked = 0;
+   node->next   = NULL;
+
+   prev = xchg(lock, node);
+   if (likely(prev == NULL)) {
+   /* Lock acquired */
+   node->locked = 1;
+   return;
+   }
+   ACCESS_ONCE(prev->next) = node;
+   smp_wmb();
+   /* Wait until the lock holder passes the lock down */
+   while (!ACCESS_ONCE(node->locked))
+   arch_mutex_cpu_relax();
+}
+
+static void mcs_spin_unlock(struct mcs_spin_node **lock, struct mcs_spin_node 
*node)
+{
+   struct mcs_spin_node *next = ACCESS_ONCE(node->next);
+
+   if (likely(!next)) {
+   /*
+* Release the lock by setting it to NULL
+*/
+   if (cmpxchg(lock, node, NULL) == node)
+   return;
+   /* Wait until the next pointer is set */
+   while (!(next = ACCESS_ONCE(node->next)))
+   arch_mutex_cpu_relax();
+   }
+   ACCESS_ONCE(next->locked) = 1;
+   smp_wmb();
+}
+
+#endif
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 6d647ae..1b6ba3f 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -111,54 +112,9 @@ EXPORT_SYMBOL(mutex_lock);
  * more or less simultaneously, the spinners need to acquire a MCS lock
  * first before spinning on the owner field.
  *
- * We don't inline mspin_lock() so that perf can correctly account for the
- * time spent in this lock function.
  */
-struct mspin_node {
-   struct mspin_node *next ;
-   int   locked;   /* 1 if lock acquired */
-};
-#defineMLOCK(mutex)((struct mspin_node **)&((mutex)->spin_mlock))
 
-static noinline
-void mspin_lock(struct mspin_node **lock, struct mspin_node *node)
-{
-   struct mspin_node *prev;
-
-   /* Init node */
-   node->locked = 0;
-   node->next   = NULL;
-
-   prev = xchg(lock, node);
-   if (likely(prev == NULL)) {
-   /* Lock acquired */
-   node->locked = 1;
-   return;
-   }
-   ACCESS_ONCE(prev->next) = node;
-   smp_wmb();
-   /* Wait until the lock holder passes the lock down */
-   while (!ACCESS_ONCE(node->locked))
-   arch_mutex_cpu_relax();
-}
-
-static void mspin_unlock(struct mspin_node **lock, struct mspin_node *node)
-{
-   struct mspin_node *next = ACCESS_ONCE(node->next);
-
-   if (likely(!next)) {
-   /*
-* Release the lock by setting it to NULL
-*/
-   if (cmpxchg(lock, node, NULL) == node)
-   return;
-   /* Wait until the next pointer is set */
-   while (!(next = ACCESS_ONCE(node->next)))
-   arch_mutex_cpu_relax();
-   }
-   ACCESS_ONCE(next->locked) = 1;
-   smp_wmb();
-}
+#defineMLOCK(mutex)((struct mcs_spin_node 
**)&((mutex)->spin_mlock))
 
 /*
  * Mutex spinning code migrated from kernel/sched/core.c
@@ -448,7 +404,7 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,
 
for (;;) {
struct task_struct *owner;
-   struct mspin_node  node;
+   struct mcs_spin_node  node;
 
if (!__builtin_constant_p(ww_ctx == NULL) && ww_ctx->acquired > 
0) {
struct ww_mutex *ww;
@@ -470,10 +426,10 @@ __mutex_lock_common(struct mutex *lock, long state, 
unsigned int subclass,

[PATCH v6 3/6] rwsem: remove try_reader_grant label do_wake

2013-09-25 Thread Tim Chen
That make code simple and more readable

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 42f1b1a..a8055cf 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -85,15 +85,17 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
adjustment = 0;
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-   oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-   /* A writer stole the lock. Undo our reader grant. */
+   while (1) {
+   oldcount = rwsem_atomic_update(adjustment, sem)
+   - adjustment;
+   if (likely(oldcount >= RWSEM_WAITING_BIAS))
+   break;
+
+/* A writer stole the lock.  Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
return sem;
/* Last active locker left. Retry waking readers. */
-   goto try_reader_grant;
}
}
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 6/6] rwsem: do optimistic spinning for writer lock acquisition

2013-09-25 Thread Tim Chen
We want to add optimistic spinning to rwsems because
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and
Davidlohr noticed it when converting the i_mmap_mutex to a rwsem in some
aim7 workloads. We've noticed that the biggest difference
is when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and overhead of moving tasks in and out of wait queue.

Allowing optimistic spinning before putting the writer on the wait queue
reduces wait queue contention and provided greater chance for the rwsem
to get acquired. With these changes, rwsem is on par with mutex.

Reviewed-by: Ingo Molnar 
Reviewed-by: Peter Zijlstra 
Reviewed-by: Peter Hurley 
Signed-off-by: Tim Chen 
Signed-off-by: Davidlohr Bueso 
---
 include/linux/rwsem.h |6 +-
 kernel/rwsem.c|   19 +-
 lib/rwsem.c   |  203 -
 3 files changed, 207 insertions(+), 21 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..ef5a83a 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -26,6 +26,8 @@ struct rw_semaphore {
longcount;
raw_spinlock_t  wait_lock;
struct list_headwait_list;
+   struct task_struct  *owner; /* write owner */
+   void*spin_mlock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map  dep_map;
 #endif
@@ -58,7 +60,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #define __RWSEM_INITIALIZER(name)  \
{ RWSEM_UNLOCKED_VALUE, \
  __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \
- LIST_HEAD_INIT((name).wait_list)  \
+ LIST_HEAD_INIT((name).wait_list), \
+ NULL, \
+ NULL  \
  __RWSEM_DEP_MAP_INIT(name) }
 
 #define DECLARE_RWSEM(name) \
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..d74d1c9 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include 
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+   sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+   sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
int ret = __down_write_trylock(sem);
 
-   if (ret == 1)
+   if (ret == 1) {
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+   rwsem_set_owner(sem);
+   }
return ret;
 }
 
@@ -86,6 +99,7 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
__up_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +114,7 @@ void downgrade_write(struct rw_semaphore *sem)
 * dependency.
 */
__downgrade_write(sem);
+   rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +137,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct 
lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +157,7 @@ void down_write_nested(struct rw_semaphore *sem, int 
subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 1d6e6e8..9535ef7 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Initialize an rwsem:
@@ -27,6 +29,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
sem->count = RWSEM_UNLOCKED_VALUE;
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
+   sem->owner = NULL;
+   sem->spin_mlock = NULL;
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -194,14 +198,179 @@ struct rw_semaphore __sched 
*rwsem_down_read_failed(struct rw_semaphore *sem)
return sem;
 }
 
+static inline int rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+   if (!(count & RWSEM_ACTIVE_MASK)) {
+   /* Try acquiring the write lock. */
+   if (sem->count == RWSEM

[PATCH v6 2/6] rwsem: remove 'out' label in do_wake

2013-09-25 Thread Tim Chen
That make code simple and more readable.

Signed-off-by: Alex Shi 
---
 lib/rwsem.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..42f1b1a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 * will block as they will notice the queued writer.
 */
wake_up_process(waiter->task);
-   goto out;
+   return sem;
}
 
/* Writers might steal the lock before we grant it to the next reader.
@@ -91,7 +91,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
/* A writer stole the lock. Undo our reader grant. */
if (rwsem_atomic_update(-adjustment, sem) &
RWSEM_ACTIVE_MASK)
-   goto out;
+   return sem;
/* Last active locker left. Retry waking readers. */
goto try_reader_grant;
}
@@ -136,7 +136,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
sem->wait_list.next = next;
next->prev = &sem->wait_list;
 
- out:
return sem;
 }
 
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 4/6] rwsem/wake: check lock before do atomic update

2013-09-25 Thread Tim Chen
Atomic update lock and roll back will cause cache bouncing in large
machine. A lock status pre-read can relieve this problem

Suggested-by: Davidlohr bueso 
Suggested-by: Tim Chen 
Signed-off-by: Alex Shi 
---
 lib/rwsem.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index a8055cf..1d6e6e8 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
struct rwsem_waiter *waiter;
struct task_struct *tsk;
struct list_head *next;
-   long oldcount, woken, loop, adjustment;
+   long woken, loop, adjustment;
 
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
@@ -86,6 +86,12 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
if (wake_type != RWSEM_WAKE_READ_OWNED) {
adjustment = RWSEM_ACTIVE_READ_BIAS;
while (1) {
+   long oldcount;
+
+   /* A writer stole the lock. */
+   if (unlikely(sem->count < RWSEM_WAITING_BIAS))
+   return sem;
+
oldcount = rwsem_atomic_update(adjustment, sem)
- adjustment;
if (likely(oldcount >= RWSEM_WAITING_BIAS))
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 1/6] rwsem: check the lock before cpmxchg in down_write_trylock

2013-09-25 Thread Tim Chen
Cmpxchg will cause the cacheline bouning when do the value checking,
that cause scalability issue in a large machine (like a 80 core box).

So a lock pre-read can relief this contention.

Signed-off-by: Alex Shi 
---
 include/asm-generic/rwsem.h |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..5ba80e7 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   if (unlikely(sem->count != RWSEM_UNLOCKED_VALUE))
+   return 0;
 
-   tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
-   return tmp == RWSEM_UNLOCKED_VALUE;
+   return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
-- 
1.7.4.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3 v2] mm: Reorg code to allow i_mmap_mutex acquisition to be done by caller of page_referenced & try_to_unmap

2012-09-10 Thread Tim Chen
We reorganize the page_referenced and try_to_unmap code to determine
explicitly if mapping->i_mmap_mutex needs to be acquired.  This allows
us to acquire the mutex for multiple pages in batch. We can call 
__page_referenced or __try_to_unmap with the mutex already acquired so
the mutex doesn't have to be acquired multiple times.

Tim

---
Signed-off-by: Tim Chen 
---
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index fd07c45..f1320b1 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -156,8 +156,11 @@ static inline void page_dup_rmap(struct page *page)
 /*
  * Called from mm/vmscan.c to handle paging out
  */
-int page_referenced(struct page *, int is_locked,
-   struct mem_cgroup *memcg, unsigned long *vm_flags);
+int needs_page_mmap_mutex(struct page *page);
+int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg,
+   unsigned long *vm_flags);
+int __page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg,
+   unsigned long *vm_flags);
 int page_referenced_one(struct page *, struct vm_area_struct *,
unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
 
@@ -176,6 +179,7 @@ enum ttu_flags {
 bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 int try_to_unmap(struct page *, enum ttu_flags flags);
+int __try_to_unmap(struct page *, enum ttu_flags flags);
 int try_to_unmap_one(struct page *, struct vm_area_struct *,
unsigned long address, enum ttu_flags flags);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 5b5ad58..8be1799 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -843,8 +843,7 @@ static int page_referenced_file(struct page *page,
 * so we can safely take mapping->i_mmap_mutex.
 */
BUG_ON(!PageLocked(page));
-
-   mutex_lock(&mapping->i_mmap_mutex);
+   BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex));
 
/*
 * i_mmap_mutex does not stabilize mapcount at all, but mapcount
@@ -869,21 +868,15 @@ static int page_referenced_file(struct page *page,
break;
}
 
-   mutex_unlock(&mapping->i_mmap_mutex);
return referenced;
 }
 
-/**
- * page_referenced - test if the page was referenced
- * @page: the page to test
- * @is_locked: caller holds lock on the page
- * @memcg: target memory cgroup
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the 
page
- *
- * Quick test_and_clear_referenced for all mappings to a page,
- * returns the number of ptes which referenced the page.
- */
-int page_referenced(struct page *page,
+int needs_page_mmap_mutex(struct page *page)
+{
+   return page->mapping && !PageKsm(page) && !PageAnon(page);
+}
+
+int __page_referenced(struct page *page,
int is_locked,
struct mem_cgroup *memcg,
unsigned long *vm_flags)
@@ -919,6 +912,32 @@ out:
return referenced;
 }
 
+/**
+ * page_referenced - test if the page was referenced
+ * @page: the page to test
+ * @is_locked: caller holds lock on the page
+ * @memcg: target memory cgroup
+ * @vm_flags: collect encountered vma->vm_flags who actually referenced the 
page
+ *
+ * Quick test_and_clear_referenced for all mappings to a page,
+ * returns the number of ptes which referenced the page.
+ */
+int page_referenced(struct page *page,
+   int is_locked,
+   struct mem_cgroup *memcg,
+   unsigned long *vm_flags)
+{
+   int result, needs_lock;
+
+   needs_lock = needs_page_mmap_mutex(page);
+   if (needs_lock)
+   mutex_lock(&page->mapping->i_mmap_mutex);
+   result = __page_referenced(page, is_locked, memcg, vm_flags);
+   if (needs_lock)
+   mutex_unlock(&page->mapping->i_mmap_mutex);
+   return result;
+}
+
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
unsigned long address)
 {
@@ -1560,7 +1579,7 @@ static int try_to_unmap_file(struct page *page, enum 
ttu_flags flags)
unsigned long max_nl_size = 0;
unsigned int mapcount;
 
-   mutex_lock(&mapping->i_mmap_mutex);
+   BUG_ON(!mutex_is_locked(&mapping->i_mmap_mutex));
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
if (address == -EFAULT)
@@ -1640,7 +1659,24 @@ static int try_to_unmap_file(struct page *page, enum 
ttu_flags flags)
list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
vma->vm_private_data = NULL;
 out:
-   mutex_unlock(&mapping->i_mmap_mutex);
+   return ret;
+}
+
+int __try_to_unmap(struct page *page, enum ttu_flags flags)
+{
+   int ret;
+
+   BU

[PATCH 3/3 v2] mm: Batch page_check_references in shrink_page_list sharing the same i_mmap_mutex

2012-09-10 Thread Tim Chen
In shrink_page_list, call to page_referenced_file and try_to_unmap will cause 
the
acquisition/release of mapping->i_mmap_mutex for each page in the page
list.  However, it is very likely that successive pages in the list
share the same mapping and we can reduce the frequency of i_mmap_mutex
acquisition by holding the mutex in shrink_page_list before calling
__page_referenced and __try_to_unmap. This improves the
performance when the system has a lot page reclamations for file mapped
pages if workloads are using a lot of memory for page cache.

Tim

---
Signed-off-by: Tim Chen 
Signed-off-by: Matthew Wilcox 
--- 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d4ab646..0428639 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -786,7 +786,7 @@ static enum page_references page_check_references(struct 
page *page,
int referenced_ptes, referenced_page;
unsigned long vm_flags;
 
-   referenced_ptes = page_referenced(page, 1, mz->mem_cgroup, &vm_flags);
+   referenced_ptes = __page_referenced(page, 1, mz->mem_cgroup, &vm_flags);
referenced_page = TestClearPageReferenced(page);
 
/* Lumpy reclaim - ignore references */
@@ -856,6 +856,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
unsigned long nr_writeback = 0;
+   struct mutex *i_mmap_mutex = NULL;
 
cond_resched();
 
@@ -909,7 +910,15 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
}
}
 
+   if (needs_page_mmap_mutex(page) &&
+  i_mmap_mutex != &page->mapping->i_mmap_mutex) {
+   if (i_mmap_mutex)
+   mutex_unlock(i_mmap_mutex);
+   i_mmap_mutex = &page->mapping->i_mmap_mutex;
+   mutex_lock(i_mmap_mutex);
+   }
references = page_check_references(page, mz, sc);
+
switch (references) {
case PAGEREF_ACTIVATE:
goto activate_locked;
@@ -939,7 +948,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * processes. Try to unmap it here.
 */
if (page_mapped(page) && mapping) {
-   switch (try_to_unmap(page, TTU_UNMAP)) {
+   switch (__try_to_unmap(page, TTU_UNMAP)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1090,6 +1099,8 @@ keep_lumpy:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
 
+   if (i_mmap_mutex)
+   mutex_unlock(i_mmap_mutex);
nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages,
   &free_pages);
 






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3 v2] mm: Batch page reclamation under shink_page_list

2012-09-10 Thread Tim Chen
This is the second version of the patch series. Thanks to Matthew Wilcox 
for many valuable suggestions on improving the patches.

To do page reclamation in shrink_page_list function, there are two
locks taken on a page by page basis.  One is the tree lock protecting
the radix tree of the page mapping and the other is the
mapping->i_mmap_mutex protecting the mapped
pages.  I try to batch the operations on pages sharing the same lock
to reduce lock contentions.  The first patch batch the operations protected by
tree lock while the second and third patch batch the operations protected by 
the i_mmap_mutex.

I managed to get 14% throughput improvement when with a workload putting
heavy pressure of page cache by reading many large mmaped files
simultaneously on a 8 socket Westmere server.

Tim

Signed-off-by: Tim Chen 
---
Diffstat 

 include/linux/rmap.h |8 +++-
 mm/rmap.c|  110 ++---
 mm/vmscan.c  |  113 +-
 3 files changed, 185 insertions(+), 46 deletions(-)
















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3 v2] mm: Batch unmapping of file mapped pages in shrink_page_list

2012-09-10 Thread Tim Chen
We gather the pages that need to be unmapped in shrink_page_list.  We batch
the unmap to reduce the frequency of acquisition of
the tree lock protecting the mapping's radix tree. This is
possible as successive pages likely share the same mapping in 
__remove_mapping_batch routine.  This avoids excessive cache bouncing of
the tree lock when page reclamations are occurring simultaneously.

Tim
---
Signed-off-by: Tim Chen 
--- 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aac5672..d4ab646 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -600,6 +600,85 @@ cannot_free:
return 0;
 }
 
+/* Same as __remove_mapping, but batching operations to minimize locking */
+/* Pages to be unmapped should be locked first */
+static int __remove_mapping_batch(struct list_head *unmap_pages,
+ struct list_head *ret_pages,
+ struct list_head *free_pages)
+{
+   struct address_space *mapping, *next;
+   LIST_HEAD(swap_pages);
+   swp_entry_t swap;
+   struct page *page;
+   int nr_reclaimed;
+
+   mapping = NULL;
+   nr_reclaimed = 0;
+   while (!list_empty(unmap_pages)) {
+
+   page = lru_to_page(unmap_pages);
+   BUG_ON(!PageLocked(page));
+
+   list_del(&page->lru);
+   next = page_mapping(page);
+   if (mapping != next) {
+   if (mapping)
+   spin_unlock_irq(&mapping->tree_lock);
+   mapping = next;
+   spin_lock_irq(&mapping->tree_lock);
+   }
+
+   if (!page_freeze_refs(page, 2))
+   goto cannot_free;
+   if (unlikely(PageDirty(page))) {
+   page_unfreeze_refs(page, 2);
+   goto cannot_free;
+   }
+
+   if (PageSwapCache(page)) {
+   __delete_from_swap_cache(page);
+   /* swapcache_free need to be called without tree_lock */
+   list_add(&page->lru, &swap_pages);
+   } else {
+   void (*freepage)(struct page *);
+
+   freepage = mapping->a_ops->freepage;
+
+   __delete_from_page_cache(page);
+   mem_cgroup_uncharge_cache_page(page);
+
+   if (freepage != NULL)
+   freepage(page);
+
+   unlock_page(page);
+   nr_reclaimed++;
+   list_add(&page->lru, free_pages);
+   }
+   continue;
+cannot_free:
+   unlock_page(page);
+   list_add(&page->lru, ret_pages);
+   VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
+
+   }
+
+   if (mapping)
+   spin_unlock_irq(&mapping->tree_lock);
+
+   while (!list_empty(&swap_pages)) {
+   page = lru_to_page(&swap_pages);
+   list_del(&page->lru);
+
+   swap.val = page_private(page);
+   swapcache_free(swap, page);
+
+   unlock_page(page);
+   nr_reclaimed++;
+   list_add(&page->lru, free_pages);
+   }
+
+   return nr_reclaimed;
+}
 /*
  * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
  * someone else has a ref on the page, abort and return 0.  If it was
@@ -771,6 +850,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 {
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+   LIST_HEAD(unmap_pages);
int pgactivate = 0;
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
@@ -969,17 +1049,13 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
}
}
 
-   if (!mapping || !__remove_mapping(mapping, page))
+   if (!mapping)
goto keep_locked;
 
-   /*
-* At this point, we have no other references and there is
-* no way to pick any more up (removed from LRU, removed
-* from pagecache). Can use non-atomic bitops now (and
-* we obviously don't have to worry about waking up a process
-* waiting on the page lock, because there are no references.
-*/
-   __clear_page_locked(page);
+   /* remove pages from mapping in batch at end of loop */
+   list_add(&page->lru, &unmap_pages);
+   continue;
+
 free_it:
nr_reclaimed++;
 
@@ -1014,6 +1090,9 @@ keep_lumpy:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
 
+   nr_reclaimed += __remove_mapping_batch(&unmap_pages, &ret_pages,
+  

Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Tim Chen
On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
> Andrew Morton  writes:
> 
> > On Thu,  4 Oct 2012 01:50:42 +0200
> > Andrea Arcangeli  wrote:
> >
> >> This is a new AutoNUMA27 release for Linux v3.6.
> >
> > Peter's numa/sched patches have been in -next for a week. 
> 
> Did they pass review? I have some doubts.
> 
> The last time I looked it also broke numactl.
> 
> > Guys, what's the plan here?
> 
> Since they are both performance features their ultimate benefit
> is how much faster they make things (and how seldom they make things
> slower)
> 
> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?
> 
> For a change like this I think less regression is actually more
> important than the highest peak numbers.
> 
> -Andi
> 

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest.  For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] Patchset to use PCLMULQDQ to accelerate CRC-T10DIF checksum computation

2013-04-16 Thread Tim Chen
Herbert,

Currently the CRC-T10DIF checksum is computed using a generic table lookup
algorithm.  By switching the checksum to PCLMULQDQ based computation,
we can speedup the computation by 8x for checksumming 512 bytes and
even more for larger buffer size.  This will improve performance of SCSI
drivers turning on the CRC-T10IDF checksum.  In our SSD based experiments,
we have seen in disk throughput by 3.5x with T10DIF.

This patchset provide the x86_64 routine using PCLMULQDQ instruction
and switch the crc_t10dif library function to use the faster PCLMULQDQ
based routine when available.

Will appreciate if you can consider merging this for the 3.10 kernel.

Tim

Tim Chen (4):
  Wrap crc_t10dif function all to use crypto transform framework
  Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
  Glue code to cast accelerated CRCT10DIF assembly as a crypto
transform
  Simple correctness and speed test for CRCT10DIF hash

 arch/x86/crypto/Makefile|   2 +
 arch/x86/crypto/crct10dif-pcl-asm_64.S  | 659 
 arch/x86/crypto/crct10dif-pclmul_glue.c | 153 
 crypto/Kconfig  |  21 +
 crypto/tcrypt.c |   8 +
 crypto/testmgr.c|  10 +
 crypto/testmgr.h|  24 ++
 include/linux/crc-t10dif.h  |  10 +
 lib/crc-t10dif.c|  96 +
 9 files changed, 983 insertions(+)
 create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S
 create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash

2013-04-16 Thread Tim Chen
These are simple tests to do sanity check of CRC T10 DIF hash.  The
correctness of the transform can be checked with the command
modprobe tcrypt mode=47
The speed of the transform can be evaluated with the command
modprobe tcrypt mode=320

Set the cpu frequency to constant and turn turbo off when running the
speed test so the frequency governor will not tweak the frequency and
affects the measurements.

Signed-off-by: Tim Chen 
Tested-by: Keith Busch 
---
 crypto/tcrypt.c  |  8 
 crypto/testmgr.c | 10 ++
 crypto/testmgr.h | 24 
 3 files changed, 42 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 24ea7df..5e95722 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1174,6 +1174,10 @@ static int do_test(int m)
ret += tcrypt_test("ghash");
break;
 
+   case 47:
+   ret += tcrypt_test("crct10dif");
+   break;
+
case 100:
ret += tcrypt_test("hmac(md5)");
break;
@@ -1498,6 +1502,10 @@ static int do_test(int m)
test_hash_speed("crc32c", sec, generic_hash_speed_template);
if (mode > 300 && mode < 400) break;
 
+   case 320:
+   test_hash_speed("crct10dif", sec, generic_hash_speed_template);
+   if (mode > 300 && mode < 400) break;
+
case 399:
break;
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 3807084..b165316 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1944,6 +1944,16 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+   .alg = "crct10dif",
+   .test = alg_test_hash,
+   .fips_allowed = 1,
+   .suite = {
+   .hash = {
+   .vecs = crct10dif_tv_template,
+   .count = CRCT10DIF_TEST_VECTORS
+   }
+   }
+   }, {
.alg = "cryptd(__driver-cbc-aes-aesni)",
.test = alg_test_null,
.fips_allowed = 1,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d503660..6262b74 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -450,6 +450,30 @@ static struct hash_testvec rmd320_tv_template[] = {
}
 };
 
+#define CRCT10DIF_TEST_VECTORS 2
+static struct hash_testvec crct10dif_tv_template[] = {
+   {
+   .plaintext = "abc",
+   .psize  = 3,
+#ifdef __LITTLE_ENDIAN
+   .digest = "\x3b\x44",
+#else
+   .digest = "\x44\x3b",
+#endif
+   }, {
+   .plaintext =
+   "abcd",
+   .psize  = 56,
+#ifdef __LITTLE_ENDIAN
+   .digest = "\xe3\x9c",
+#else
+   .digest = "\x9c\xe3",
+#endif
+   .np = 2,
+   .tap= { 28, 28 }
+   }
+};
+
 /*
  * SHA1 test vectors  from from FIPS PUB 180-1
  * Long vector from CAVS 5.0
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Glue code to cast accelerated CRCT10DIF assembly as a crypto transform

2013-04-16 Thread Tim Chen
Glue code that plugs the PCLMULQDQ accelerated CRC T10 DIF hash into the
crypto framework.  The config CRYPTO_CRCT10DIF_PCLMUL should be turned
on to enable the feature.  The crc_t10dif crypto library function will
use this faster algorithm when crct10dif_pclmul module is loaded.

Signed-off-by: Tim Chen 
Tested-by: Keith Busch 
---
 arch/x86/crypto/Makefile|   2 +
 arch/x86/crypto/crct10dif-pclmul_glue.c | 153 
 crypto/Kconfig  |  21 +
 3 files changed, 176 insertions(+)
 create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 03cd731..d544a66 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
 obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
 obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
+obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
 
 # These modules require assembler to support AVX.
 ifeq ($(avx_supported),yes)
@@ -70,3 +71,4 @@ crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
 crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o
 sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o 
sha256_ssse3_glue.o
 sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o 
sha512_ssse3_glue.o
+crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o
diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c 
b/arch/x86/crypto/crct10dif-pclmul_glue.c
new file mode 100644
index 000..e87f8d8
--- /dev/null
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -0,0 +1,153 @@
+/*
+ * Cryptographic API.
+ *
+ * T10 Data Integrity Field CRC16 Crypto Xform using PCLMULQDQ Instructions
+ *
+ * Copyright (C) 2013 Intel Corporation
+ * Author: Tim Chen 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+asmlinkage __u16 crc_t10dif_pcl(__u16 crc, const unsigned char *buf,
+   size_t len);
+
+struct chksum_desc_ctx {
+   __u16 crc;
+};
+
+/*
+ * Steps through buffer one byte at at time, calculates reflected
+ * crc using table.
+ */
+
+static int chksum_init(struct shash_desc *desc)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   ctx->crc = 0;
+
+   return 0;
+}
+
+static int chksum_update(struct shash_desc *desc, const u8 *data,
+unsigned int length)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   if (irq_fpu_usable()) {
+   kernel_fpu_begin();
+   ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
+   kernel_fpu_end();
+   } else
+   ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
+   return 0;
+}
+
+static int chksum_final(struct shash_desc *desc, u8 *out)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   *(__u16 *)out = ctx->crc;
+   return 0;
+}
+
+static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len,
+   u8 *out)
+{
+   if (irq_fpu_usable()) {
+   kernel_fpu_begin();
+   *(__u16 *)out = crc_t10dif_pcl(*crcp, data, len);
+   kernel_fpu_end();
+   } else
+   *(__u16 *)out = crc_t10dif_generic(*crcp, data, len);
+   return 0;
+}
+
+static int chksum_finup(struct shash_desc *desc, const u8 *data,
+   unsigned int len, u8 *out)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   return __chksum_finup(&ctx->crc, data, len, out);
+}
+
+static int chksum_digest(struct shash_desc *desc, const u8 *data,
+unsigned int length, u8 *out)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   return __chksum_finup(&ctx->crc, data, length, out);
+}
+
+static struct shash_alg alg = {
+   .digestsize =   CRC_T10DIF_DIGEST_SIZE,
+   .init   =   chksum_init,
+   .update =   chksum_update,
+   .final

[PATCH 1/4] Wrap crc_t10dif function all to use crypto transform framework

2013-04-16 Thread Tim Chen
When CRC T10 DIF is calculated using the crypto transform framework, we
wrap the crc_t10dif function call to utilize it.  This allows us to
take advantage of any accelerated CRC T10 DIF transform that is
plugged into the crypto framework.

Signed-off-by: Tim Chen 
Tested-by: Keith Busch 
---
 include/linux/crc-t10dif.h | 10 +
 lib/crc-t10dif.c   | 96 ++
 2 files changed, 106 insertions(+)

diff --git a/include/linux/crc-t10dif.h b/include/linux/crc-t10dif.h
index a9c96d8..f0eb4d5 100644
--- a/include/linux/crc-t10dif.h
+++ b/include/linux/crc-t10dif.h
@@ -3,6 +3,16 @@
 
 #include 
 
+#ifdef CONFIG_CRYPTO_CRCT10DIF
+
+#define CRC_T10DIF_DIGEST_SIZE 2
+#define CRC_T10DIF_BLOCK_SIZE 1
+
+__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len);
+void crc_t10dif_update_lib(void);
+
+#endif /* CONFIG_CRYPTO_CRCT10DIF */
+
 __u16 crc_t10dif(unsigned char const *, size_t);
 
 #endif
diff --git a/lib/crc-t10dif.c b/lib/crc-t10dif.c
index fbbd66e..41f469a 100644
--- a/lib/crc-t10dif.c
+++ b/lib/crc-t10dif.c
@@ -11,6 +11,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 /* Table generated using the following polynomium:
  * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
@@ -51,6 +54,98 @@ static const __u16 t10_dif_crc_table[256] = {
0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
 };
 
+#ifdef CONFIG_CRYPTO_CRCT10DIF
+
+static struct crypto_shash *crct10dif_tfm;
+
+/* flag to indicate update to new algorithm in progress*/
+static bool crc_t10dif_newalg;
+
+__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len)
+{
+   unsigned int i;
+
+   for (i = 0 ; i < len ; i++)
+   crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 
0xff];
+
+   return crc;
+}
+EXPORT_SYMBOL(crc_t10dif_generic);
+
+/*
+ * If we have defined crypto transform for CRC-T10DIF, use that instead.
+ * This allows us to plug in fast version of CRC-T10DIF when available.
+ */
+
+void crc_t10dif_update_lib()
+{
+   struct crypto_shash *old_tfm, *new_tfm;
+
+   old_tfm = crct10dif_tfm;
+   crc_t10dif_newalg = true;
+   /* make sure new alg flag is turned on before starting to switch tfm */
+   mb();
+
+   new_tfm = crypto_alloc_shash("crct10dif", 0, 0);
+   if (IS_ERR(new_tfm))
+   goto done;
+
+   if (old_tfm)
+   if (crypto_tfm_alg_priority(&old_tfm->base) >=
+   crypto_tfm_alg_priority(&new_tfm->base)) {
+   crypto_free_shash(new_tfm);
+   goto done;
+   }
+   crct10dif_tfm = new_tfm;
+   /* make sure update to tfm pointer is completed */
+   mb();
+   crypto_free_shash(old_tfm);
+
+done:
+   crc_t10dif_newalg = false;
+}
+EXPORT_SYMBOL(crc_t10dif_update_lib);
+
+__u16 crc_t10dif(const unsigned char *buffer, size_t len)
+{
+   struct {
+   struct shash_desc shash;
+   char ctx[2];
+   } desc;
+   int err;
+
+   /* plugging in new alg or not using a tfm? */
+   if (unlikely(crc_t10dif_newalg) || (!crct10dif_tfm))
+   return crc_t10dif_generic(0, buffer, len);
+
+   desc.shash.tfm = crct10dif_tfm;
+   desc.shash.flags = 0;
+   *(__u16 *)desc.ctx = 0;
+
+   err = crypto_shash_update(&desc.shash, buffer, len);
+   BUG_ON(err);
+
+   return *(__u16 *)desc.ctx;
+}
+EXPORT_SYMBOL(crc_t10dif);
+
+static int __init crc_t10dif_mod_init(void)
+{
+   crct10dif_tfm = NULL;
+   crc_t10dif_newalg = false;
+   return 0;
+}
+
+static void __exit crc_t10dif_mod_fini(void)
+{
+   if (crct10dif_tfm)
+   crypto_free_shash(crct10dif_tfm);
+}
+
+module_init(crc_t10dif_mod_init);
+module_exit(crc_t10dif_mod_fini);
+
+#else
 __u16 crc_t10dif(const unsigned char *buffer, size_t len)
 {
__u16 crc = 0;
@@ -62,6 +157,7 @@ __u16 crc_t10dif(const unsigned char *buffer, size_t len)
return crc;
 }
 EXPORT_SYMBOL(crc_t10dif);
+#endif
 
 MODULE_DESCRIPTION("T10 DIF CRC calculation");
 MODULE_LICENSE("GPL");
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction

2013-04-16 Thread Tim Chen
This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
instructions.  Details discussing the implementation can be found in the
paper:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
URL: http://download.intel.com/design/intarch/papers/323102.pdf

Signed-off-by: Tim Chen 
Tested-by: Keith Busch 
---
 arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 +
 1 file changed, 659 insertions(+)
 create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S

diff --git a/arch/x86/crypto/crct10dif-pcl-asm_64.S 
b/arch/x86/crypto/crct10dif-pcl-asm_64.S
new file mode 100644
index 000..c497d0e
--- /dev/null
+++ b/arch/x86/crypto/crct10dif-pcl-asm_64.S
@@ -0,0 +1,659 @@
+
+# Implement fast CRC-T10DIF computation with SSE and PCLMULQDQ instructions
+#
+# Copyright (c) 2013, Intel Corporation
+#
+# Authors:
+# Erdinc Ozturk 
+# Vinodh Gopal 
+# James Guilford 
+#     Tim Chen 
+#
+# This software is available to you under a choice of one of two
+# licenses.  You may choose to be licensed under the terms of the GNU
+# General Public License (GPL) Version 2, available from the file
+# COPYING in the main directory of this source tree, or the
+# OpenIB.org BSD license below:
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+#   notice, this list of conditions and the following disclaimer in the
+#   documentation and/or other materials provided with the
+#   distribution.
+#
+# * Neither the name of the Intel Corporation nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+#
+# THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION ""AS IS"" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES# LOSS OF USE, DATA, OR
+# PROFITS# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#   Function API:
+#   UINT16 crc_t10dif_pcl(
+#   UINT16 init_crc, //initial CRC value, 16 bits
+#   const unsigned char *buf, //buffer pointer to calculate CRC on
+#   UINT64 len //buffer length in bytes (64-bit data)
+#   )#
+#
+#   Reference paper titled "Fast CRC Computation for Generic
+#  Polynomials Using PCLMULQDQ Instruction"
+#   URL: http://download.intel.com/design/intarch/papers/323102.pdf
+#
+#
+
+#include 
+
+.text
+
+#definearg1 %rdi
+#definearg2 %rsi
+#definearg3 %rdx
+
+#definearg1_low32 %edi
+
+ENTRY(crc_t10dif_pcl)
+.align 16
+
+   # adjust the 16-bit initial_crc value, scale it to 32 bits
+   shl $16, arg1_low32
+
+   # Allocate Stack Space
+   mov %rsp, %rcx
+   sub $16*10, %rsp
+   and $~(0x20 - 1), %rsp
+
+   # push the xmm registers into the stack to maintain
+   movdqa %xmm10, 16*2(%rsp)
+   movdqa %xmm11, 16*3(%rsp)
+   movdqa %xmm8 , 16*4(%rsp)
+   movdqa %xmm12, 16*5(%rsp)
+   movdqa %xmm13, 16*6(%rsp)
+   movdqa %xmm6,  16*7(%rsp)
+   movdqa %xmm7,  16*8(%rsp)
+   movdqa %xmm9,  16*9(%rsp)
+
+
+   # check if smaller than 256
+   cmp $256, arg3
+
+   # for sizes less than 128, we can't fold 64B at a time...
+   jl  _less_than_128
+
+
+   # load the initial crc value
+   movdarg1_low32, %xmm10  # initial crc
+
+   # crc value does not need to be byte-reflected, but it needs
+   # to be moved to the high part of the register.
+   # because data will be byte-reflected and will align with
+   # initial crc at correct place.
+   pslldq  $12, %xmm10
+
+   movdqa  SHUF_MASK(%rip), %xmm11
+   # receive the initial 64B data, xor the initial crc value
+   movdqu  16*0(arg2), %xmm0
+   movdqu  16*1(arg2), %xmm1
+   movdqu  16*2(arg2), %xmm2
+   movdqu  16*3(arg2), %xmm3
+   movdqu  16*4(arg2), %xmm4
+   movdqu  16*5(arg2), %xmm5
+   m

Re: [PATCH 4/4] Simple correctness and speed test for CRCT10DIF hash

2013-04-17 Thread Tim Chen
On Wed, 2013-04-17 at 20:58 +0300, Jussi Kivilinna wrote:
> On 16.04.2013 19:20, Tim Chen wrote:
> > These are simple tests to do sanity check of CRC T10 DIF hash.  The
> > correctness of the transform can be checked with the command
> > modprobe tcrypt mode=47
> > The speed of the transform can be evaluated with the command
> > modprobe tcrypt mode=320
> > 
> > Set the cpu frequency to constant and turn turbo off when running the
> > speed test so the frequency governor will not tweak the frequency and
> > affects the measurements.
> > 
> > Signed-off-by: Tim Chen 
> > Tested-by: Keith Busch 
> 
> >  
> > +#define CRCT10DIF_TEST_VECTORS 2
> > +static struct hash_testvec crct10dif_tv_template[] = {
> > +   {
> > +   .plaintext = "abc",
> > +   .psize  = 3,
> > +#ifdef __LITTLE_ENDIAN
> > +   .digest = "\x3b\x44",
> > +#else
> > +   .digest = "\x44\x3b",
> > +#endif
> > +   }, {
> > +   .plaintext =
> > +   "abcd",
> > +   .psize  = 56,
> > +#ifdef __LITTLE_ENDIAN
> > +   .digest = "\xe3\x9c",
> > +#else
> > +   .digest = "\x9c\xe3",
> > +#endif
> > +   .np = 2,
> > +   .tap= { 28, 28 }
> > +   }
> > +};
> > +
> 
> Are these large enough to test all code paths in the PCLMULQDQ implementation?

We have tested the implementation exhaustively for large number of code paths.
The tcrypt tests here are only meant for sanity check, and not for
exhaustive tests.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction

2013-04-17 Thread Tim Chen
On Wed, 2013-04-17 at 20:58 +0300, Jussi Kivilinna wrote:
> On 16.04.2013 19:20, Tim Chen wrote:
> > This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
> > instructions.  Details discussing the implementation can be found in the
> > paper:
> > 
> > "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
> > URL: http://download.intel.com/design/intarch/papers/323102.pdf
> 
> URL does not work.

Thanks for catching this. Will update.

> 
> > 
> > Signed-off-by: Tim Chen 
> > Tested-by: Keith Busch 
> > ---
> >  arch/x86/crypto/crct10dif-pcl-asm_64.S | 659 
> > +
> >  1 file changed, 659 insertions(+)
> >  create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S
> 
> > +
> > +   # Allocate Stack Space
> > +   mov %rsp, %rcx
> > +   sub $16*10, %rsp
> > +   and $~(0x20 - 1), %rsp
> > +
> > +   # push the xmm registers into the stack to maintain
> > +   movdqa %xmm10, 16*2(%rsp)
> > +   movdqa %xmm11, 16*3(%rsp)
> > +   movdqa %xmm8 , 16*4(%rsp)
> > +   movdqa %xmm12, 16*5(%rsp)
> > +   movdqa %xmm13, 16*6(%rsp)
> > +   movdqa %xmm6,  16*7(%rsp)
> > +   movdqa %xmm7,  16*8(%rsp)
> > +   movdqa %xmm9,  16*9(%rsp)
> 
> You don't need to store (and restore) these, as 'crc_t10dif_pcl' is called 
> between kernel_fpu_begin/_end.

That's true.  Will skip the xmm save/restore in update to the patch.

> 
> > +
> > +
> > +   # check if smaller than 256
> > +   cmp $256, arg3
> > +
> 
> > +_cleanup:
> > +   # scale the result back to 16 bits
> > +   shr $16, %eax
> > +   movdqa  16*2(%rsp), %xmm10
> > +   movdqa  16*3(%rsp), %xmm11
> > +   movdqa  16*4(%rsp), %xmm8
> > +   movdqa  16*5(%rsp), %xmm12
> > +   movdqa  16*6(%rsp), %xmm13
> > +   movdqa  16*7(%rsp), %xmm6
> > +   movdqa  16*8(%rsp), %xmm7
> > +   movdqa  16*9(%rsp), %xmm9
> 
> Registers are overwritten by kernel_fpu_end.
> 
> > +   mov %rcx, %rsp
> > +   ret
> > +ENDPROC(crc_t10dif_pcl)
> > +
> 
> You should move ENDPROC at end of the full function.
> 
> > +
> > +
> > +.align 16
> > +_less_than_128:
> > +
> > +   # check if there is enough buffer to be able to fold 16B at a time
> > +   cmp $32, arg3
> 
> > +   movdqa  (%rsp), %xmm7
> > +   pshufb  %xmm11, %xmm7
> > +   pxor%xmm0 , %xmm7   # xor the initial crc value
> > +
> > +   psrldq  $7, %xmm7
> > +
> > +   jmp _barrett
> 
> Move ENDPROC here.
> 

Will do.

Tim



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/4] Wrap crc_t10dif function all to use crypto transform framework

2013-04-17 Thread Tim Chen
When CRC T10 DIF is calculated using the crypto transform framework, we
wrap the crc_t10dif function call to utilize it.  This allows us to
take advantage of any accelerated CRC T10 DIF transform that is
plugged into the crypto framework.

Signed-off-by: Tim Chen 
---
 include/linux/crc-t10dif.h | 10 +
 lib/crc-t10dif.c   | 96 ++
 2 files changed, 106 insertions(+)

diff --git a/include/linux/crc-t10dif.h b/include/linux/crc-t10dif.h
index a9c96d8..f0eb4d5 100644
--- a/include/linux/crc-t10dif.h
+++ b/include/linux/crc-t10dif.h
@@ -3,6 +3,16 @@
 
 #include 
 
+#ifdef CONFIG_CRYPTO_CRCT10DIF
+
+#define CRC_T10DIF_DIGEST_SIZE 2
+#define CRC_T10DIF_BLOCK_SIZE 1
+
+__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len);
+void crc_t10dif_update_lib(void);
+
+#endif /* CONFIG_CRYPTO_CRCT10DIF */
+
 __u16 crc_t10dif(unsigned char const *, size_t);
 
 #endif
diff --git a/lib/crc-t10dif.c b/lib/crc-t10dif.c
index fbbd66e..41f469a 100644
--- a/lib/crc-t10dif.c
+++ b/lib/crc-t10dif.c
@@ -11,6 +11,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 /* Table generated using the following polynomium:
  * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
@@ -51,6 +54,98 @@ static const __u16 t10_dif_crc_table[256] = {
0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
 };
 
+#ifdef CONFIG_CRYPTO_CRCT10DIF
+
+static struct crypto_shash *crct10dif_tfm;
+
+/* flag to indicate update to new algorithm in progress*/
+static bool crc_t10dif_newalg;
+
+__u16 crc_t10dif_generic(__u16 crc, const unsigned char *buffer, size_t len)
+{
+   unsigned int i;
+
+   for (i = 0 ; i < len ; i++)
+   crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 
0xff];
+
+   return crc;
+}
+EXPORT_SYMBOL(crc_t10dif_generic);
+
+/*
+ * If we have defined crypto transform for CRC-T10DIF, use that instead.
+ * This allows us to plug in fast version of CRC-T10DIF when available.
+ */
+
+void crc_t10dif_update_lib()
+{
+   struct crypto_shash *old_tfm, *new_tfm;
+
+   old_tfm = crct10dif_tfm;
+   crc_t10dif_newalg = true;
+   /* make sure new alg flag is turned on before starting to switch tfm */
+   mb();
+
+   new_tfm = crypto_alloc_shash("crct10dif", 0, 0);
+   if (IS_ERR(new_tfm))
+   goto done;
+
+   if (old_tfm)
+   if (crypto_tfm_alg_priority(&old_tfm->base) >=
+   crypto_tfm_alg_priority(&new_tfm->base)) {
+   crypto_free_shash(new_tfm);
+   goto done;
+   }
+   crct10dif_tfm = new_tfm;
+   /* make sure update to tfm pointer is completed */
+   mb();
+   crypto_free_shash(old_tfm);
+
+done:
+   crc_t10dif_newalg = false;
+}
+EXPORT_SYMBOL(crc_t10dif_update_lib);
+
+__u16 crc_t10dif(const unsigned char *buffer, size_t len)
+{
+   struct {
+   struct shash_desc shash;
+   char ctx[2];
+   } desc;
+   int err;
+
+   /* plugging in new alg or not using a tfm? */
+   if (unlikely(crc_t10dif_newalg) || (!crct10dif_tfm))
+   return crc_t10dif_generic(0, buffer, len);
+
+   desc.shash.tfm = crct10dif_tfm;
+   desc.shash.flags = 0;
+   *(__u16 *)desc.ctx = 0;
+
+   err = crypto_shash_update(&desc.shash, buffer, len);
+   BUG_ON(err);
+
+   return *(__u16 *)desc.ctx;
+}
+EXPORT_SYMBOL(crc_t10dif);
+
+static int __init crc_t10dif_mod_init(void)
+{
+   crct10dif_tfm = NULL;
+   crc_t10dif_newalg = false;
+   return 0;
+}
+
+static void __exit crc_t10dif_mod_fini(void)
+{
+   if (crct10dif_tfm)
+   crypto_free_shash(crct10dif_tfm);
+}
+
+module_init(crc_t10dif_mod_init);
+module_exit(crc_t10dif_mod_fini);
+
+#else
 __u16 crc_t10dif(const unsigned char *buffer, size_t len)
 {
__u16 crc = 0;
@@ -62,6 +157,7 @@ __u16 crc_t10dif(const unsigned char *buffer, size_t len)
return crc;
 }
 EXPORT_SYMBOL(crc_t10dif);
+#endif
 
 MODULE_DESCRIPTION("T10 DIF CRC calculation");
 MODULE_LICENSE("GPL");
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/4] Patchset to use PCLMULQDQ to accelerate CRC-T10DIF checksum computation

2013-04-17 Thread Tim Chen
Currently the CRC-T10DIF checksum is computed using a generic table lookup
algorithm.  By switching the checksum to PCLMULQDQ based computation,
we can speedup the computation by 8x for checksumming 512 bytes and
even more for larger buffer size.  This will improve performance of SCSI
drivers turning on the CRC-T10IDF checksum.  In our SSD based experiments,
we have seen increase disk throughput by 3.5x with T10DIF for 512 byte
block size.

This patch set provides the x86_64 routine using PCLMULQDQ instruction
and switches the crc_t10dif library function to use the faster PCLMULQDQ
based routine when available.

Tim

v1->v2
1. Get rid of unnecessary xmm registers save and restore and fix ENDPROC
position in PCLMULQDQ version of crc t10dif computation.
2. Fix URL to paper reference of CRC computation with PCLMULQDQ. 
3. Add one additional tcrypt test case to exercise more code paths through 
crc t10dif computation.
4. Fix config dependencies of CRYPTO_CRCT10DIF.

Thanks to Matthew and Jussi who reviewed the patches and Keith
for testing version 1 of the patch set.  

Tim Chen (4):
  Wrap crc_t10dif function all to use crypto transform framework
  Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
  Glue code to cast accelerated CRCT10DIF assembly as a crypto
transform
  Simple correctness and speed test for CRCT10DIF hash

 arch/x86/crypto/Makefile|   2 +
 arch/x86/crypto/crct10dif-pcl-asm_64.S  | 643 
 arch/x86/crypto/crct10dif-pclmul_glue.c | 153 
 crypto/Kconfig  |  21 ++
 crypto/tcrypt.c |   8 +
 crypto/testmgr.c|  10 +
 crypto/testmgr.h|  33 ++
 include/linux/crc-t10dif.h  |  10 +
 lib/crc-t10dif.c|  96 +
 9 files changed, 976 insertions(+)
 create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S
 create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/4] Glue code to cast accelerated CRCT10DIF assembly as a crypto transform

2013-04-17 Thread Tim Chen
Glue code that plugs the PCLMULQDQ accelerated CRC T10 DIF hash into the
crypto framework.  The config CRYPTO_CRCT10DIF_PCLMUL should be turned
on to enable the feature.  The crc_t10dif crypto library function will
use this faster algorithm when crct10dif_pclmul module is loaded.

Signed-off-by: Tim Chen 
---
 arch/x86/crypto/Makefile|   2 +
 arch/x86/crypto/crct10dif-pclmul_glue.c | 153 
 crypto/Kconfig  |  21 +
 3 files changed, 176 insertions(+)
 create mode 100644 arch/x86/crypto/crct10dif-pclmul_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 03cd731..d544a66 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
 obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
 obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
+obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
 
 # These modules require assembler to support AVX.
 ifeq ($(avx_supported),yes)
@@ -70,3 +71,4 @@ crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
 crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o
 sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o 
sha256_ssse3_glue.o
 sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o 
sha512_ssse3_glue.o
+crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o
diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c 
b/arch/x86/crypto/crct10dif-pclmul_glue.c
new file mode 100644
index 000..e87f8d8
--- /dev/null
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -0,0 +1,153 @@
+/*
+ * Cryptographic API.
+ *
+ * T10 Data Integrity Field CRC16 Crypto Xform using PCLMULQDQ Instructions
+ *
+ * Copyright (C) 2013 Intel Corporation
+ * Author: Tim Chen 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+asmlinkage __u16 crc_t10dif_pcl(__u16 crc, const unsigned char *buf,
+   size_t len);
+
+struct chksum_desc_ctx {
+   __u16 crc;
+};
+
+/*
+ * Steps through buffer one byte at at time, calculates reflected
+ * crc using table.
+ */
+
+static int chksum_init(struct shash_desc *desc)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   ctx->crc = 0;
+
+   return 0;
+}
+
+static int chksum_update(struct shash_desc *desc, const u8 *data,
+unsigned int length)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   if (irq_fpu_usable()) {
+   kernel_fpu_begin();
+   ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
+   kernel_fpu_end();
+   } else
+   ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
+   return 0;
+}
+
+static int chksum_final(struct shash_desc *desc, u8 *out)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   *(__u16 *)out = ctx->crc;
+   return 0;
+}
+
+static int __chksum_finup(__u16 *crcp, const u8 *data, unsigned int len,
+   u8 *out)
+{
+   if (irq_fpu_usable()) {
+   kernel_fpu_begin();
+   *(__u16 *)out = crc_t10dif_pcl(*crcp, data, len);
+   kernel_fpu_end();
+   } else
+   *(__u16 *)out = crc_t10dif_generic(*crcp, data, len);
+   return 0;
+}
+
+static int chksum_finup(struct shash_desc *desc, const u8 *data,
+   unsigned int len, u8 *out)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   return __chksum_finup(&ctx->crc, data, len, out);
+}
+
+static int chksum_digest(struct shash_desc *desc, const u8 *data,
+unsigned int length, u8 *out)
+{
+   struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+
+   return __chksum_finup(&ctx->crc, data, length, out);
+}
+
+static struct shash_alg alg = {
+   .digestsize =   CRC_T10DIF_DIGEST_SIZE,
+   .init   =   chksum_init,
+   .update =   chksum_update,
+   .final  =   chksum_final,
+ 

[PATCH v2 2/4] Accelerated CRC T10 DIF computation with PCLMULQDQ instruction

2013-04-17 Thread Tim Chen
This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
instructions.  Details discussing the implementation can be found in the
paper:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

Signed-off-by: Tim Chen 
---
 arch/x86/crypto/crct10dif-pcl-asm_64.S | 643 +
 1 file changed, 643 insertions(+)
 create mode 100644 arch/x86/crypto/crct10dif-pcl-asm_64.S

diff --git a/arch/x86/crypto/crct10dif-pcl-asm_64.S 
b/arch/x86/crypto/crct10dif-pcl-asm_64.S
new file mode 100644
index 000..35e9756
--- /dev/null
+++ b/arch/x86/crypto/crct10dif-pcl-asm_64.S
@@ -0,0 +1,643 @@
+
+# Implement fast CRC-T10DIF computation with SSE and PCLMULQDQ instructions
+#
+# Copyright (c) 2013, Intel Corporation
+#
+# Authors:
+# Erdinc Ozturk 
+# Vinodh Gopal 
+# James Guilford 
+#     Tim Chen 
+#
+# This software is available to you under a choice of one of two
+# licenses.  You may choose to be licensed under the terms of the GNU
+# General Public License (GPL) Version 2, available from the file
+# COPYING in the main directory of this source tree, or the
+# OpenIB.org BSD license below:
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+#   notice, this list of conditions and the following disclaimer in the
+#   documentation and/or other materials provided with the
+#   distribution.
+#
+# * Neither the name of the Intel Corporation nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+#
+# THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION ""AS IS"" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#   Function API:
+#   UINT16 crc_t10dif_pcl(
+#   UINT16 init_crc, //initial CRC value, 16 bits
+#   const unsigned char *buf, //buffer pointer to calculate CRC on
+#   UINT64 len //buffer length in bytes (64-bit data)
+#   );
+#
+#   Reference paper titled "Fast CRC Computation for Generic
+#  Polynomials Using PCLMULQDQ Instruction"
+#   URL: http://www.intel.com/content/dam/www/public/us/en/documents
+#  /white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
+#
+#
+
+#include 
+
+.text
+
+#definearg1 %rdi
+#definearg2 %rsi
+#definearg3 %rdx
+
+#definearg1_low32 %edi
+
+ENTRY(crc_t10dif_pcl)
+.align 16
+
+   # adjust the 16-bit initial_crc value, scale it to 32 bits
+   shl $16, arg1_low32
+
+   # Allocate Stack Space
+   mov %rsp, %rcx
+   sub $16*2, %rsp
+   # align stack to 16 byte boundary
+   and $~(0x10 - 1), %rsp
+
+   # check if smaller than 256
+   cmp $256, arg3
+
+   # for sizes less than 128, we can't fold 64B at a time...
+   jl  _less_than_128
+
+
+   # load the initial crc value
+   movdarg1_low32, %xmm10  # initial crc
+
+   # crc value does not need to be byte-reflected, but it needs
+   # to be moved to the high part of the register.
+   # because data will be byte-reflected and will align with
+   # initial crc at correct place.
+   pslldq  $12, %xmm10
+
+   movdqa  SHUF_MASK(%rip), %xmm11
+   # receive the initial 64B data, xor the initial crc value
+   movdqu  16*0(arg2), %xmm0
+   movdqu  16*1(arg2), %xmm1
+   movdqu  16*2(arg2), %xmm2
+   movdqu  16*3(arg2), %xmm3
+   movdqu  16*4(arg2), %xmm4
+   movdqu  16*5(arg2), %xmm5
+   movdqu  16*6(arg2), %xmm6
+   movdqu  16*7(arg2), %xmm7
+
+   pshufb  %xmm11, %xmm0
+   # XOR the initial_crc value
+   pxor%xmm10, %xmm0
+   pshufb

[PATCH v2 4/4] Simple correctness and speed test for CRCT10DIF hash

2013-04-17 Thread Tim Chen
These are simple tests to do sanity check of CRC T10 DIF hash.  The
correctness of the transform can be checked with the command
modprobe tcrypt mode=47
The speed of the transform can be evaluated with the command
modprobe tcrypt mode=320

Set the cpu frequency to constant and turn turbo off when running the
speed test so the frequency governor will not tweak the frequency and
affects the measurements.

Signed-off-by: Tim Chen 
---
 crypto/tcrypt.c  |  8 
 crypto/testmgr.c | 10 ++
 crypto/testmgr.h | 33 +
 3 files changed, 51 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 24ea7df..5e95722 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1174,6 +1174,10 @@ static int do_test(int m)
ret += tcrypt_test("ghash");
break;
 
+   case 47:
+   ret += tcrypt_test("crct10dif");
+   break;
+
case 100:
ret += tcrypt_test("hmac(md5)");
break;
@@ -1498,6 +1502,10 @@ static int do_test(int m)
test_hash_speed("crc32c", sec, generic_hash_speed_template);
if (mode > 300 && mode < 400) break;
 
+   case 320:
+   test_hash_speed("crct10dif", sec, generic_hash_speed_template);
+   if (mode > 300 && mode < 400) break;
+
case 399:
break;
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 3807084..b165316 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1944,6 +1944,16 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+   .alg = "crct10dif",
+   .test = alg_test_hash,
+   .fips_allowed = 1,
+   .suite = {
+   .hash = {
+   .vecs = crct10dif_tv_template,
+   .count = CRCT10DIF_TEST_VECTORS
+   }
+   }
+   }, {
.alg = "cryptd(__driver-cbc-aes-aesni)",
.test = alg_test_null,
.fips_allowed = 1,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d503660..56916d0 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -450,6 +450,39 @@ static struct hash_testvec rmd320_tv_template[] = {
}
 };
 
+#define CRCT10DIF_TEST_VECTORS 3
+static struct hash_testvec crct10dif_tv_template[] = {
+   {
+   .plaintext = "abc",
+   .psize  = 3,
+#ifdef __LITTLE_ENDIAN
+   .digest = "\x3b\x44",
+#else
+   .digest = "\x44\x3b",
+#endif
+   }, {
+   .plaintext = "1234567890123456789012345678901234567890"
+"123456789012345678901234567890123456789",
+   .psize  = 79,
+#ifdef __LITTLE_ENDIAN
+   .digest = "\x70\x4b",
+#else
+   .digest = "\x4b\x70",
+#endif
+   }, {
+   .plaintext =
+   "abcd",
+   .psize  = 56,
+#ifdef __LITTLE_ENDIAN
+   .digest = "\xe3\x9c",
+#else
+   .digest = "\x9c\xe3",
+#endif
+   .np = 2,
+   .tap= { 28, 28 }
+   }
+};
+
 /*
  * SHA1 test vectors  from from FIPS PUB 180-1
  * Long vector from CAVS 5.0
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix prototype definitions of sha256_transform_asm, sha512_transform_asm

2013-04-19 Thread Tim Chen
Herbert,

This is a follow on patch to the optimized sha256 and sha512 patch series 
that's just
merged into the crypto-dev.  Let me know if you prefer me to respin the
patch series.

This patch corrects the prototype of sha256_transform_asm and
sha512_transform_asm function pointer declaration to static.  It also
fixes a typo in sha512_ssse3_final function that affects the computation
of upper 64 bits of the buffer size.

Thanks.

Tim

Signed-off-by: Tim Chen 
---
 arch/x86/crypto/sha256_ssse3_glue.c | 2 +-
 arch/x86/crypto/sha512_ssse3_glue.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/crypto/sha256_ssse3_glue.c 
b/arch/x86/crypto/sha256_ssse3_glue.c
index fa65453..8a0b711 100644
--- a/arch/x86/crypto/sha256_ssse3_glue.c
+++ b/arch/x86/crypto/sha256_ssse3_glue.c
@@ -53,7 +53,7 @@ asmlinkage void sha256_transform_rorx(const char *data, u32 
*digest,
 u64 rounds);
 #endif
 
-asmlinkage void (*sha256_transform_asm)(const char *, u32 *, u64);
+static void (*sha256_transform_asm)(const char *, u32 *, u64);
 
 
 static int sha256_ssse3_init(struct shash_desc *desc)
diff --git a/arch/x86/crypto/sha512_ssse3_glue.c 
b/arch/x86/crypto/sha512_ssse3_glue.c
index 295f790..3c844f2 100644
--- a/arch/x86/crypto/sha512_ssse3_glue.c
+++ b/arch/x86/crypto/sha512_ssse3_glue.c
@@ -52,7 +52,7 @@ asmlinkage void sha512_transform_rorx(const char *data, u64 
*digest,
 u64 rounds);
 #endif
 
-asmlinkage void (*sha512_transform_asm)(const char *, u64 *, u64);
+static void (*sha512_transform_asm)(const char *, u64 *, u64);
 
 
 static int sha512_ssse3_init(struct shash_desc *desc)
@@ -141,7 +141,7 @@ static int sha512_ssse3_final(struct shash_desc *desc, u8 
*out)
 
/* save number of bits */
bits[1] = cpu_to_be64(sctx->count[0] << 3);
-   bits[0] = cpu_to_be64(sctx->count[1] << 3) | sctx->count[0] >> 61;
+   bits[0] = cpu_to_be64(sctx->count[1] << 3 | sctx->count[0] >> 61);
 
/* Pad out to 112 mod 128 and append length */
index = sctx->count[0] & 0x7f;
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-13 Thread Tim Chen
Ingo,

At the time of switching the anon-vma tree's lock from mutex to 
rw-sem (commit 5a505085), we encountered regressions for fork heavy workload. 
A lot of optimizations to rw-sem (e.g. lock stealing) helped to 
mitigate the problem.  I tried an experiment on the 3.10-rc4 kernel 
to compare the performance of rw-sem to one that uses mutex. I saw 
a 8% regression in throughput for rw-sem vs a mutex implementation in
3.10-rc4.

For the experiments, I used the exim mail server workload in 
the MOSBENCH test suite on 4 socket (westmere) and a 4 socket 
(ivy bridge) with the number of clients sending mail equal 
to number of cores.  The mail server will
fork off a process to handle an incoming mail and put it into mail
spool. The lock protecting the anon-vma tree is stressed due to
heavy forking. On both machines, I saw that the mutex implementation 
has 8% more throughput.  I've pinned the cpu frequency to maximum
in the experiments.

I've tried two separate tweaks to the rw-sem on 3.10-rc4.  I've tested 
each tweak individually.

1) Add an owner field when a writer holds the lock and introduce 
optimistic spinning when an active writer is holding the semaphore.  
It reduced the context switching by 30% to a level very close to the
mutex implementation.  However, I did not see any throughput improvement
of exim.

2) When the sem->count's active field is non-zero (i.e. someone
is holding the lock), we can skip directly to the down_write_failed
path, without adding the RWSEM_DOWN_WRITE_BIAS and taking
it off again from sem->count, saving us two atomic operations.
Since we will try the lock stealing again later, this should be okay.
Unfortunately it did not improve the exim workload either.  

Any suggestions on the difference between rwsem and mutex performance
and possible improvements to recover this regression?

Thanks.

Tim

vmstat for mutex implementation: 
procs ---memory-- ---swap-- -io --system-- -cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
38  0  0 130957920  47860 19995600 056 236342 476975 14 72 
14  0  0
41  0  0 130938560  47860 21990000 0 0 236816 479676 14 72 
14  0  0

vmstat for rw-sem implementation (3.10-rc4)
procs ---memory-- ---swap-- -io --system-- -cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
40  0  0 130933984  43232 20258400 0 0 321817 690741 13 71 
16  0  0
39  0  0 130913904  43232 22481200 0 0 322193 692949 13 71 
16  0  0


Profile for mutex implementation:
5.02% exim  [kernel.kallsyms] [k] page_fault
3.67% exim  [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.66% exim  [kernel.kallsyms] [k] unmap_single_vma
2.15% exim  [kernel.kallsyms] [k] do_raw_spin_lock
2.14% exim  [kernel.kallsyms] [k] page_cache_get_speculative
2.04% exim  [kernel.kallsyms] [k] copy_page_rep
1.58% exim  [kernel.kallsyms] [k] clear_page_c
1.55% exim  [kernel.kallsyms] [k] cpu_relax
1.55% exim  [kernel.kallsyms] [k] mutex_unlock
1.42% exim  [kernel.kallsyms] [k] __slab_free
1.16% exim  [kernel.kallsyms] [k] mutex_lock  
1.12% exim  libc-2.13.so  [.] vfprintf   
0.99% exim  [kernel.kallsyms] [k] find_vma  
0.95% exim  [kernel.kallsyms] [k] __list_del_entry

Profile for rw-sem implementation
4.88% exim  [kernel.kallsyms] [k] page_fault
3.43% exim  [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.65% exim  [kernel.kallsyms] [k] unmap_single_vma
2.46% exim  [kernel.kallsyms] [k] do_raw_spin_lock
2.25% exim  [kernel.kallsyms] [k] copy_page_rep
2.01% exim  [kernel.kallsyms] [k] page_cache_get_speculative
1.81% exim  [kernel.kallsyms] [k] clear_page_c
1.51% exim  [kernel.kallsyms] [k] __slab_free
1.12% exim  libc-2.13.so  [.] vfprintf
1.06% exim  [kernel.kallsyms] [k] __list_del_entry
1.02%  swapper  [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.00% exim  [kernel.kallsyms] [k] find_vma
0.93% exim  [kernel.kallsyms] [k] mutex_unlock


turbostat for mutex implementation:
pk cor CPU%c0  GHz  TSC%c1%c3%c6 CTMP   %pc3   %pc6
82.91 2.39 2.39  11.65   2.76   2.68   51   0.00   0.00

turbostat of rw-sem implementation (3.10-rc4):
pk cor CPU%c0  GHz  TSC%c1%c3%c6 CTMP   %pc3   %pc6
80.10 2.39 2.39  14.96   2.80   2.13   52   0.00   0.00




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://

Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-14 Thread Tim Chen
Added copy to mailing list which I forgot in my previous reply:

On Thu, 2013-06-13 at 16:43 -0700, Davidlohr Bueso wrote:
> On Thu, 2013-06-13 at 16:15 -0700, Tim Chen wrote:
> > Ingo,
> > 
> > At the time of switching the anon-vma tree's lock from mutex to 
> > rw-sem (commit 5a505085), we encountered regressions for fork heavy 
> > workload. 
> > A lot of optimizations to rw-sem (e.g. lock stealing) helped to 
> > mitigate the problem.  I tried an experiment on the 3.10-rc4 kernel 
> > to compare the performance of rw-sem to one that uses mutex. I saw 
> > a 8% regression in throughput for rw-sem vs a mutex implementation in
> > 3.10-rc4.
> 
> Funny, just yesterday I was discussing this issue with Michel. While I
> didn't measure the anon-vma mutex->rwem conversion, I did convert the
> i_mmap_mutex to a rwsem and noticed a performance regression on a few
> aim7 workloads on a 8 socket, 80 core box, when keeping all writers,
> which should perform very similarly to a mutex. While some of these
> workloads recovered when I shared the lock among readers (similar to
> anon-vma), it left me wondering.
> 
> > For the experiments, I used the exim mail server workload in 
> > the MOSBENCH test suite on 4 socket (westmere) and a 4 socket 
> > (ivy bridge) with the number of clients sending mail equal 
> > to number of cores.  The mail server will
> > fork off a process to handle an incoming mail and put it into mail
> > spool. The lock protecting the anon-vma tree is stressed due to
> > heavy forking. On both machines, I saw that the mutex implementation 
> > has 8% more throughput.  I've pinned the cpu frequency to maximum
> > in the experiments.
> 
> I got some similar -8% throughput on high_systime and shared.
> 

That's interesting. Another perspective on rwsem vs mutex.

> > 
> > I've tried two separate tweaks to the rw-sem on 3.10-rc4.  I've tested 
> > each tweak individually.
> > 
> > 1) Add an owner field when a writer holds the lock and introduce 
> > optimistic spinning when an active writer is holding the semaphore.  
> > It reduced the context switching by 30% to a level very close to the
> > mutex implementation.  However, I did not see any throughput improvement
> > of exim.
> 
> I was hoping that the lack of spin on owner was the main difference with
> rwsems and am/was in the middle of implementing it. Could you send your
> patch so I can give it a try on my workloads?
> 
> Note that there have been a few recent (3.10) changes to mutexes that
> give a nice performance boost, specially on large systems, most
> noticeably:
> 
> commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> operations)
> 
> commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> cacheline contention)
> 
> It might be worth looking into doing something similar to commit
> 0dc8c730, in addition to the optimistic spinning.

Okay.  Here's my ugly experimental hack with some code lifted from optimistic 
spin 
within mutex.  I've thought about doing the MCS lock thing but decided 
to keep the first try on the optimistic spinning simple.

Matthew and I have also discussed possibly introducing some 
limited spinning for writer when semaphore is held by read.  
His idea was to have readers as well as writers set ->owner.  
Writers, as now, unconditionally clear owner.  Readers clear 
owner if sem->owner == current.  Writers spin on ->owner if ->owner 
is non-NULL and still active. That gives us a reasonable chance 
to spin since we'll be spinning on
the most recent acquirer of the lock.

Tim

Signed-off-by: Tim Chen 
---
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..331f5f0 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -29,6 +29,7 @@ struct rw_semaphore {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map  dep_map;
 #endif
+   struct task_struct  *owner;
 };
 
 extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..916747f 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,16 @@
 
 #include 
 
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+   sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+   sem->owner = NULL;
+}
+
 /*
  * lock for reading
  */
@@ -48,6 +58,7 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   rwsem_set_owner(sem);   
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +70,10 @@ int down_write_trylock(struct r

Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-14 Thread Tim Chen

> 
> Unfortunately this patch didn't make any difference, in fact it hurt
> several of the workloads even more. I also tried disabling preemption
> when spinning on owner to actually resemble spinlocks, which was my
> original plan, yet not much difference. 
> 

That's also similar to the performance I got.  There are things about
optimistic spinning that I missed that results in the better mutex
performance.  So I'm scratching my head.

> A few ideas that come to mind are avoiding taking the ->wait_lock and
> avoid dealing with waiters when doing the optimistic spinning (just like
> mutexes do).
> 

For my patch, we actually spin without the wait_lock.

> I agree that we should first deal with the optimistic spinning before
> adding the MCS complexity.
> 
> > Matthew and I have also discussed possibly introducing some 
> > limited spinning for writer when semaphore is held by read.  
> > His idea was to have readers as well as writers set ->owner.  
> > Writers, as now, unconditionally clear owner.  Readers clear 
> > owner if sem->owner == current.  Writers spin on ->owner if ->owner 
> > is non-NULL and still active. That gives us a reasonable chance 
> > to spin since we'll be spinning on
> > the most recent acquirer of the lock.
> 
> I also tried implementing this concept on top of your patch, didn't make
> much of a difference with or without it. 
> 

It also didn't make a difference for me.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-17 Thread Tim Chen
On Mon, 2013-06-17 at 09:22 -0700, Davidlohr Bueso wrote:
> On Sun, 2013-06-16 at 17:50 +0800, Alex Shi wrote:
> > On 06/14/2013 07:43 AM, Davidlohr Bueso wrote:
> > > I was hoping that the lack of spin on owner was the main difference with
> > > rwsems and am/was in the middle of implementing it. Could you send your
> > > patch so I can give it a try on my workloads?
> > > 
> > > Note that there have been a few recent (3.10) changes to mutexes that
> > > give a nice performance boost, specially on large systems, most
> > > noticeably:
> > > 
> > > commit 2bd2c92c (mutex: Make more scalable by doing less atomic
> > > operations)
> > > 
> > > commit 0dc8c730 (mutex: Queue mutex spinners with MCS lock to reduce
> > > cacheline contention)
> > > 
> > > It might be worth looking into doing something similar to commit
> > > 0dc8c730, in addition to the optimistic spinning.
> > 
> > It is a good tunning for large machine. I just following what the commit 
> > 0dc8c730 done, give a RFC patch here. I tried it on my NHM EP machine. 
> > seems no
> > clear help on aim7. but maybe it is helpful on large machine.  :)
> 
> After a lot of benchmarking, I finally got the ideal results for aim7,
> so far: this patch + optimistic spinning with preemption disabled. Just
> like optimistic spinning, this patch by itself makes little to no
> difference, yet combined is where we actually outperform 3.10-rc5. In
> addition, I noticed extra throughput when disabling preemption in
> try_optimistic_spin().
> 
> With i_mmap as a rwsem and these changes I could see performance
> benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> (+5%), shared (+15%) and short (+4%), most of them after around 500
> users, for fewer users, it made little to no difference.
> 

Thanks.  Those are encouraging numbers.  On my exim workload I didn't
get a boost when I added in the preempt disable in optimistic spin and
put Alex's changes in. Can you send me your combined patch to see if
there may be something you did that I've missed.  I have a tweak to
Alex's patch below to simplify things a bit.  

Tim

> Thanks,
> Davidlohr
> 
> > 
> > 
> > diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> > index bb1e2cd..240729a 100644
> > --- a/include/asm-generic/rwsem.h
> > +++ b/include/asm-generic/rwsem.h
> > @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore 
> > *sem)
> >  
> >  static inline int __down_write_trylock(struct rw_semaphore *sem)
> >  {
> > -   long tmp;
> > +   if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> > +   return 0;
> >  
> > -   tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > - RWSEM_ACTIVE_WRITE_BIAS);
> > -   return tmp == RWSEM_UNLOCKED_VALUE;
> > +   return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
> >  }
> >  
> >  /*
> > diff --git a/lib/rwsem.c b/lib/rwsem.c
> > index 19c5fa9..9e54e20 100644
> > --- a/lib/rwsem.c
> > +++ b/lib/rwsem.c
> > @@ -64,7 +64,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
> > rwsem_wake_type wake_type)
> > struct rwsem_waiter *waiter;
> > struct task_struct *tsk;
> > struct list_head *next;
> > -   long oldcount, woken, loop, adjustment;
> > +   long woken, loop, adjustment;
> >  
> > waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
> > if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
> > @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
> > rwsem_wake_type wake_type)
> >  * will block as they will notice the queued writer.
> >  */
> > wake_up_process(waiter->task);
> > -   goto out;
> > +   return sem;
> > }
> >  
> > /* Writers might steal the lock before we grant it to the next reader.
> > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
> > rwsem_wake_type wake_type)
> > adjustment = 0;
> > if (wake_type != RWSEM_WAKE_READ_OWNED) {
> > adjustment = RWSEM_ACTIVE_READ_BIAS;
> > - try_reader_grant:
> > -   oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > -   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > -   /* A writer stole the lock. Undo our reader grant. */
> > +   while (1) {
> > +   long oldcount;
> > +
> > +   /* A writer stole the lock. */
> > +   if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
> > +   return sem;
> > +
> > +   if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
> > +   cpu_relax();
> > +   continue;
> > +   }

The above two if statements could be cleaned up as a single check:

if (unlikely(sem->count < RWSEM_WAITING_BIAS))
return sem;
 
This one s

Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-17 Thread Tim Chen
On Fri, 2013-06-14 at 15:47 -0700, Michel Lespinasse wrote:
> On Fri, Jun 14, 2013 at 3:31 PM, Davidlohr Bueso  
> wrote:
> > A few ideas that come to mind are avoiding taking the ->wait_lock and
> > avoid dealing with waiters when doing the optimistic spinning (just like
> > mutexes do).
> >
> > I agree that we should first deal with the optimistic spinning before
> > adding the MCS complexity.
> 
> Maybe it would be worth disabling the MCS patch in mutex and comparing
> that to the rwsem patches ? Just to make sure the rwsem performance
> delta isn't related to that.
> 

I've tried to back out the MCS patch.  In fact, for exim, it is about 1%
faster without MCS.  So the better performance of mutex I saw was not
due to MCS.  Thanks for the suggestion.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-17 Thread Tim Chen
On Mon, 2013-06-17 at 12:05 -0700, Davidlohr Bueso wrote:

> > 
> > Thanks.  Those are encouraging numbers.  On my exim workload I didn't
> > get a boost when I added in the preempt disable in optimistic spin and
> > put Alex's changes in. Can you send me your combined patch to see if
> > there may be something you did that I've missed.  I have a tweak to
> > Alex's patch below to simplify things a bit.  
> > 
> 
> I'm using:
> 
> int rwsem_optimistic_spin(struct rw_semaphore *sem)
> {
> struct  task_struct *owner;
> 
> /* sem->wait_lock should not be held when attempting optimistic 
> spinning */
> if (!rwsem_can_spin_on_owner(sem))
> return 0;
> 
> preempt_disable();
> for (;;) {
> owner = ACCESS_ONCE(sem->owner);
> if (owner && !rwsem_spin_on_owner(sem, owner))
> break;
> 
> /* wait_lock will be acquired if write_lock is obtained */
> if (rwsem_try_write_lock(sem->count, true, sem)) {
> preempt_enable();
> return 1;
> }
> 
> /*
>   
>  
>  * When there's no owner, we might have preempted between the 
>   
>  
>  * owner acquiring the lock and setting the owner field. If   
>   
>  
>  * we're an RT task that will live-lock because we won't let  
>   
>  
>  * the owner complete.
>   
>  
>  */
> if (!owner && (need_resched() || rt_task(current)))
> break;
> 
> /*
>   
>  
>  * The cpu_relax() call is a compiler barrier which forces
>   
>  
>  * everything in this loop to be re-loaded. We don't need 
>   
>  
>  * memory barriers as we'll eventually observe the right  
>   
>  
>  * values at the cost of a few extra spins.   
>   
>  
>  */
> arch_mutex_cpu_relax();
> 
> }
> 
> preempt_enable();
> return 0;
> }

This is identical to the changes that I've tested.  Thanks for sharing.

Tim

> > > > @@ -85,15 +85,28 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
> > > > rwsem_wake_type wake_type)
> > > > adjustment = 0;
> > > > if (wake_type != RWSEM_WAKE_READ_OWNED) {
> > > > adjustment = RWSEM_ACTIVE_READ_BIAS;
> > > > - try_reader_grant:
> > > > -   oldcount = rwsem_atomic_update(adjustment, sem) - 
> > > > adjustment;
> > > > -   if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > > > -   /* A writer stole the lock. Undo our reader 
> > > > grant. */
> > > > +   while (1) {
> > > > +   long oldcount;
> > > > +
> > > > +   /* A writer stole the lock. */
> > > > +   if (unlikely(sem->count & RWSEM_ACTIVE_MASK))
> > > > +   return sem;
> > > > +
> > > > +   if (unlikely(sem->count < RWSEM_WAITING_BIAS)) {
> > > > +   cpu_relax();
> > > > +   continue;
> > > > +   }
> > 
> > The above two if statements could be cleaned up as a single check:
> > 
> > if (unlikely(sem->count < RWSEM_WAITING_BIAS))
> > return sem;
> >  
> > This one statement is sufficient to check that we don't have a writer
> > stolen the lock before we attempt to acquire the read lock by modifying
> > sem->count.  
> 
> We probably still want to keep the cpu relaxation if the statement
> doesn't comply.
> 
> Thanks,
> Davidlohr
> 
> 
> --
> To unsubscribe from this li

Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-17 Thread Tim Chen
On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote:
> On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote:
> > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> > > After a lot of benchmarking, I finally got the ideal results for aim7,
> > > so far: this patch + optimistic spinning with preemption disabled. Just
> > > like optimistic spinning, this patch by itself makes little to no
> > > difference, yet combined is where we actually outperform 3.10-rc5. In
> > > addition, I noticed extra throughput when disabling preemption in
> > > try_optimistic_spin().
> > > 
> > > With i_mmap as a rwsem and these changes I could see performance
> > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > > users, for fewer users, it made little to no difference.
> > 
> > A pretty good number. what's the cpu number in your machine? :)
> 
> 8-socket, 80 cores (ht off)
> 
> 

David,

I wonder if you are interested to try the experimental patch below.  
It tries to avoid unnecessary writes to the sem->count when we are 
going to fail the down_write by executing rwsem_down_write_failed_s
instead of rwsem_down_write_failed.  It should further reduce the
cache line bouncing.  It didn't make a difference for my 
workload.  Wonder if it may help yours more in addition to the 
other two patches.  Right now the patch is an ugly hack.  I'll merge
rwsem_down_write_failed_s and rwsem_down_write_failed into one
function if this approach actually helps things.

I'll clean these three patches after we have some idea of their
effectiveness.

Thanks.

Tim

Signed-off-by: Tim Chen 
---
commit 04c8ad3f21861746d5b7fff55a6ef186a4dd0765
Author: Tim Chen 
Date:   Mon Jun 10 04:50:04 2013 -0700

Try skip write to rwsem->count when we have active lockers

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..83f9184 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -33,6 +33,7 @@ struct rw_semaphore {
 
 extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
 extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_down_write_failed_s(struct rw_semaphore 
*sem);
 extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
 extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
 
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..188f6ea 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -42,12 +42,22 @@ EXPORT_SYMBOL(down_read_trylock);
 /*
  * lock for writing
  */
+
+static void ___down_write(struct rw_semaphore *sem)
+{
+   if (sem->count & RWSEM_ACTIVE_MASK) {
+   rwsem_down_write_failed_s(sem);
+   return;
+   }
+   __down_write(sem);
+}
+
 void __sched down_write(struct rw_semaphore *sem)
 {
might_sleep();
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
-   LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+   LOCK_CONTENDED(sem, __down_write_trylock, ___down_write);
 }
 
 EXPORT_SYMBOL(down_write);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..25143b5 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -248,6 +248,63 @@ struct rw_semaphore __sched 
*rwsem_down_write_failed(struct rw_semaphore *sem)
return sem;
 }
 
+struct rw_semaphore __sched *rwsem_down_write_failed_s(struct rw_semaphore 
*sem)
+{
+   long count, adjustment = 0;
+   struct rwsem_waiter waiter;
+   struct task_struct *tsk = current;
+
+   /* set up my own style of waitqueue */
+   waiter.task = tsk;
+   waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+   raw_spin_lock_irq(&sem->wait_lock);
+   if (list_empty(&sem->wait_list))
+   adjustment += RWSEM_WAITING_BIAS;
+   list_add_tail(&waiter.list, &sem->wait_list);
+
+   /* If there were already threads queued before us and there are no
+* active writers, the lock must be read owned; so we try to wake
+* any read locks that were queued ahead of us. */
+   if (adjustment == 0) {
+   if (sem->count > RWSEM_WAITING_BIAS)
+   sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS);
+   } else
+   count = rwsem_atomic_update(adjustment, sem);
+
+   /* wait until we successfully acquire the lock */
+   set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   while (true) {
+   if (!(sem->count & RWSEM_ACTIVE_MASK)) {
+   /* Try acquiring the write lock. */
+   count = RWSEM_ACTIVE_WRITE_BIAS;
+   if (!list_is_singular(&sem->wait_list))
+   count += RWSEM_WAITING_BIAS;
+

Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-19 Thread Tim Chen
On Wed, 2013-06-19 at 15:16 +0200, Ingo Molnar wrote:

> > vmstat for mutex implementation: 
> > procs ---memory-- ---swap-- -io --system-- 
> > -cpu-
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa st
> > 38  0  0 130957920  47860 19995600 056 236342 476975 14 
> > 72 14  0  0
> > 41  0  0 130938560  47860 21990000 0 0 236816 479676 14 
> > 72 14  0  0
> > 
> > vmstat for rw-sem implementation (3.10-rc4)
> > procs ---memory-- ---swap-- -io --system-- 
> > -cpu-
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa st
> > 40  0  0 130933984  43232 20258400 0 0 321817 690741 13 
> > 71 16  0  0
> > 39  0  0 130913904  43232 22481200 0 0 322193 692949 13 
> > 71 16  0  0
> 
> It appears the main difference is that the rwsem variant context-switches 
> about 36% more than the mutex version, right?
> 
> I'm wondering how that's possible - the lock is mostly write-locked, 
> correct? So the lock-stealing from Davidlohr Bueso and Michel Lespinasse 
> ought to have brought roughly the same lock-stealing behavior as mutexes 
> do, right?
> 
> So the next analytical step would be to figure out why rwsem lock-stealing 
> is not behaving in an equivalent fashion on this workload. Do readers come 
> in frequently enough to disrupt write-lock-stealing perhaps?
> 
> Context-switch call-graph profiling might shed some light on where the 
> extra context switches come from...
> 
> Something like:
> 
>   perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1
> 
> or a variant thereof might do the trick.
> 

Ingo,

It appears that we are having much more down write failure causing a process to
block vs going to the slow path for the mutex case.

Here's the profile data from
perf record -g -e sched:sched_switch --filter 'prev_state != 0' -a sleep 1

3.10-rc4 (mutex implementation context switch profile)

-  59.51% exim  [kernel.kallsyms]  [k] perf_trace_sched_switch
   - perf_trace_sched_switch
   - __schedule
  - 99.98% schedule
 + 33.07% schedule_timeout
 + 23.84% pipe_wait
 + 20.24% do_wait
 + 12.37% do_exit
 + 5.34% sigsuspend
 - 3.40% schedule_preempt_disabled
  __mutex_lock_common.clone.8
  __mutex_lock_slowpath
- mutex_lock   <-low rate mutex blocking
   + 65.71% lock_anon_vma_root.clone.24
   + 19.03% anon_vma_lock.clone.21
   + 7.14% dup_mm
   + 5.36% unlink_file_vma
   + 1.71% ima_file_check
   + 0.64% generic_file_aio_write
 - 1.07% rwsem_down_write_failed
  call_rwsem_down_write_failed
  exit_shm
  do_exit
  do_group_exit
  SyS_exit_group
  system_call_fastpath
-  27.61%   smtpbm  [kernel.kallsyms]  [k] perf_trace_sched_switch
   - perf_trace_sched_switch
   - __schedule
   - schedule
   - schedule_timeout
  + 100.00% sk_wait_data
+   0.46%  swapper  [kernel.kallsyms]  [k] perf_trace_sched_switch


--
3.10-rc4 implementation (rw-sem context switch profile)

81.91% exim  [kernel.kallsyms]  [k] perf_trace_sched_switch
- perf_trace_sched_switch
- __schedule
   - 99.99% schedule
  - 65.26% rwsem_down_write_failed   <--High write lock blocking
 - call_rwsem_down_write_failed
- 79.36% lock_anon_vma_root.clone.27
   + 52.64% unlink_anon_vmas
   + 47.36% anon_vma_clone
+ 12.16% anon_vma_fork
+ 8.00% anon_vma_free
  + 11.96% schedule_timeout
  + 7.66% do_exit
  + 7.61% do_wait
  + 5.49% pipe_wait
  + 1.82% sigsuspend
13.55%   smtpbm  [kernel.kallsyms]  [k] perf_trace_sched_switch
- perf_trace_sched_switch
- __schedule
- schedule
- schedule_timeout
 0.11%rcu_sched  [kernel.kallsyms]  [k] perf_trace_sched_switch


Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance regression from switching lock to rw-sem for anon-vma tree

2013-06-19 Thread Tim Chen
On Wed, 2013-06-19 at 16:11 -0700, Davidlohr Bueso wrote:
> On Mon, 2013-06-17 at 17:08 -0700, Tim Chen wrote:
> > On Mon, 2013-06-17 at 16:35 -0700, Davidlohr Bueso wrote:
> > > On Tue, 2013-06-18 at 07:20 +0800, Alex Shi wrote:
> > > > On 06/18/2013 12:22 AM, Davidlohr Bueso wrote:
> > > > > After a lot of benchmarking, I finally got the ideal results for aim7,
> > > > > so far: this patch + optimistic spinning with preemption disabled. 
> > > > > Just
> > > > > like optimistic spinning, this patch by itself makes little to no
> > > > > difference, yet combined is where we actually outperform 3.10-rc5. In
> > > > > addition, I noticed extra throughput when disabling preemption in
> > > > > try_optimistic_spin().
> > > > > 
> > > > > With i_mmap as a rwsem and these changes I could see performance
> > > > > benefits for alltests (+14.5%), custom (+17%), disk (+11%), 
> > > > > high_systime
> > > > > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > > > > users, for fewer users, it made little to no difference.
> > > > 
> > > > A pretty good number. what's the cpu number in your machine? :)
> > > 
> > > 8-socket, 80 cores (ht off)
> > > 
> > > 
> > 
> > David,
> > 
> > I wonder if you are interested to try the experimental patch below.  
> > It tries to avoid unnecessary writes to the sem->count when we are 
> > going to fail the down_write by executing rwsem_down_write_failed_s
> > instead of rwsem_down_write_failed.  It should further reduce the
> > cache line bouncing.  It didn't make a difference for my 
> > workload.  Wonder if it may help yours more in addition to the 
> > other two patches.  Right now the patch is an ugly hack.  I'll merge
> > rwsem_down_write_failed_s and rwsem_down_write_failed into one
> > function if this approach actually helps things.
> > 
> 
> I tried this on top of the patches we've already been dealing with. It
> actually did more harm than good. Only got a slight increase in the
> five_sec workload, for the rest either no effect, or negative. So far
> the best results are still with spin on owner + preempt disable + Alex's
> patches.
> 

Thanks for trying it out. A little disappointed as I was expecting no
change in performance for the worst case.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] Make the batch size of the percpu_counter configurable

2013-05-29 Thread Tim Chen
On Wed, 2013-05-29 at 12:26 -0700, Andrew Morton wrote:
> On Wed, 22 May 2013 16:37:18 -0700 Tim Chen  
> wrote:
> 
> > Currently the per cpu counter's batch size for memory accounting is
> > configured as twice the number of cpus in the system.  However,
> > for system with very large memory, it is more appropriate to make it
> > proportional to the memory size per cpu in the system.
> > 
> > For example, for a x86_64 system with 64 cpus and 128 GB of memory,
> > the batch size is only 2*64 pages (0.5 MB).  So any memory accounting
> > changes of more than 0.5MB will overflow the per cpu counter into
> > the global counter.  Instead, for the new scheme, the batch size
> > is configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256),
> > which is more inline with the memory size.
> 
> I renamed the patch to "mm: tune vm_committed_as percpu_counter
> batching size".
> 
> Do we have any performance testing results?  They're pretty important
> for a performance-improvement patch ;)
> 

I've done a repeated brk test of 800KB (from will-it-scale test suite)
with 80 concurrent processes on a 4 socket Westmere machine with a 
total of 40 cores.  Without the patch, about 80% of cpu is spent on
spin-lock contention within the vm_committed_as counter. With the patch,
there's a 73x speedup on the benchmark and the lock contention drops off
almost entirely.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >