Answering question 2. I have done a comprehensive performance analysis
based on the benchmark application.

Note: The SRU changes how the sys_membarrier syscall is used. The
implementation that we want to change to in this SRU never blocks, while
the previous implementation does. This makes performance analysis
entirely workload dependant. On busy servers with lots of background
processes, sys_membarrier will block more often, compared to quiet
servers with no background processes.

The following is based on a quiet server with no background processes.

Test parameters
===============
Ubuntu 18.04.4
KVM, 2 vcpus
0.10.1 liburcu
4.15.0-99-generic
Test program "test_urcu[_bp]": http://paste.ubuntu.com/p/5vXVycQjYk/
(only difference is #include <urcu.h> or #include <urcu-bp.h>)

No changes to source code
=========================

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads   6065490002 nr_writes          237 nr_ops   6065490239
nr_reads   6476219475 nr_writes          186 nr_ops   6476219661
nr_reads   6474789528 nr_writes          183 nr_ops   6474789711
nr_reads   6476326433 nr_writes          188 nr_ops   6476326621
nr_reads   6479298142 nr_writes          179 nr_ops   6479298321
nr_reads   6476429569 nr_writes          186 nr_ops   6476429755
nr_reads   6478019994 nr_writes          191 nr_ops   6478020185
nr_reads   6479117595 nr_writes          183 nr_ops   6479117778
nr_reads   6478302181 nr_writes          185 nr_ops   6478302366
nr_reads   6481003399 nr_writes          191 nr_ops   6481003590

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads    644339902 nr_writes          485 nr_ops    644340387
nr_reads    644092800 nr_writes         1101 nr_ops    644093901
nr_reads    644676446 nr_writes          494 nr_ops    644676940
nr_reads    643845915 nr_writes          500 nr_ops    643846415
nr_reads    645156053 nr_writes          502 nr_ops    645156555
nr_reads    644626421 nr_writes          497 nr_ops    644626918
nr_reads    644710679 nr_writes          495 nr_ops    644711174
nr_reads    644445530 nr_writes          503 nr_ops    644446033
nr_reads    645150707 nr_writes          497 nr_ops    645151204
nr_reads    643681268 nr_writes          496 nr_ops    643681764

Commits c0bb9f and 374530 patched in
====================================

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads   4097663510 nr_writes         6516 nr_ops   4097670026
nr_reads   4177088332 nr_writes         4183 nr_ops   4177092515
nr_reads   4153780077 nr_writes         1907 nr_ops   4153781984
nr_reads   4150954044 nr_writes         3942 nr_ops   4150957986
nr_reads   4267855073 nr_writes         2102 nr_ops   4267857175
nr_reads   4131310825 nr_writes         7119 nr_ops   4131317944
nr_reads   4183771431 nr_writes         1919 nr_ops   4183773350
nr_reads   4270944170 nr_writes         4958 nr_ops   4270949128
nr_reads   4123277225 nr_writes         4228 nr_ops   4123281453
nr_reads   4266997284 nr_writes         1723 nr_ops   4266999007


ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads   6530208343 nr_writes         8860 nr_ops   6530217203
nr_reads   6514357222 nr_writes        10568 nr_ops   6514367790
nr_reads   6517420660 nr_writes         9534 nr_ops   6517430194
nr_reads   6510005433 nr_writes        11799 nr_ops   6510017232
nr_reads   6492226563 nr_writes        12517 nr_ops   6492239080
nr_reads   6532405460 nr_writes         6548 nr_ops   6532412008
nr_reads   6514205150 nr_writes         9686 nr_ops   6514214836
nr_reads   6481643486 nr_writes        16167 nr_ops   6481659653
nr_reads   6509268022 nr_writes        10582 nr_ops   6509278604
nr_reads   6523168701 nr_writes         9066 nr_ops   6523177767


Comparing and contrasting with 20.04:
=====================================

Test Parameters:
================
Ubuntu 20.04 LTS
KVM, 2 vcpus
0.11.1 liburcu
5.4.0-29-generic

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads   4270089636 nr_writes         1638 nr_ops   4270091274
nr_reads   4281598850 nr_writes         3008 nr_ops   4281601858
nr_reads   4241230576 nr_writes         3612 nr_ops   4241234188
nr_reads   4230643208 nr_writes         5367 nr_ops   4230648575
nr_reads   4333495124 nr_writes         1354 nr_ops   4333496478
nr_reads   4291295097 nr_writes         3545 nr_ops   4291298642
nr_reads   4232582737 nr_writes         1983 nr_ops   4232584720
nr_reads   4268926719 nr_writes         3363 nr_ops   4268930082
nr_reads   4266736459 nr_writes         4881 nr_ops   4266741340
nr_reads   4313525276 nr_writes         4549 nr_ops   4313529825

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads   6848011482 nr_writes         3171 nr_ops   6848014653
nr_reads   6842990129 nr_writes         4577 nr_ops   6842994706
nr_reads   6862298832 nr_writes         2875 nr_ops   6862301707
nr_reads   6849848255 nr_writes         4292 nr_ops   6849852547
nr_reads   6846387545 nr_writes         4975 nr_ops   6846392520
nr_reads   6860547626 nr_writes         3376 nr_ops   6860551002
nr_reads   6853028794 nr_writes         2784 nr_ops   6853031578
nr_reads   6846021299 nr_writes         3383 nr_ops   6846024682
nr_reads   6833359957 nr_writes         5917 nr_ops   6833365874
nr_reads   6851224193 nr_writes         2432 nr_ops   6851226625

Comparing and contrasting with 14.04:
=====================================

Test Parameters:
================
Ubuntu 14.04.6 LTS
KVM, 2 vcpus
0.7.12 liburcu
3.13.0-170-generic

ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu 6 2 10
nr_reads    284080749 nr_writes       790657 nr_ops    284871406
nr_reads    283785838 nr_writes       647058 nr_ops    284432896
nr_reads    273424217 nr_writes      1535098 nr_ops    274959315
nr_reads    283550711 nr_writes      1442548 nr_ops    284993259
nr_reads    282557773 nr_writes       946106 nr_ops    283503879
nr_reads    286811777 nr_writes       837176 nr_ops    287648953
nr_reads    273278986 nr_writes      1738549 nr_ops    275017535
nr_reads    287141686 nr_writes       652772 nr_ops    287794458
nr_reads    287697411 nr_writes       982440 nr_ops    288679851
nr_reads    281468419 nr_writes       830736 nr_ops    282299155

ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu_bp 6 2 10
nr_reads    670447719 nr_writes        16731 nr_ops    670464450
nr_reads    670464435 nr_writes         9970 nr_ops    670474405
nr_reads    670235233 nr_writes         4932 nr_ops    670240165
nr_reads    670853867 nr_writes         6845 nr_ops    670860712
nr_reads    670970962 nr_writes          307 nr_ops    670971269
nr_reads    670346111 nr_writes         8161 nr_ops    670354272
nr_reads    669748209 nr_writes         6824 nr_ops    669755033
nr_reads    671242419 nr_writes          249 nr_ops    671242668
nr_reads    670318007 nr_writes         8990 nr_ops    670326997
nr_reads    669872685 nr_writes          269 nr_ops    669872954

Analysis
========

We see from the two Bionic tests, we see the nr_ops go from 6065490239
to 4097670026 for test_urcu from unpatched to patched. This is a 1/3
performance impairment, numbers wise. However, if you compare with the
numbers from Focal, we see the results are in line with what you would
expect if you were running Focal, with 4097670026 vs 4270091274.

For test_urcu_bp, the two Bionic tests show a dramatic difference. We go
from 644340387 nr_ops for unpatched to 6530217203 nr_ops, which is a 10x
improvement. These numbers are in line with what you would expect on
Focal, with 6848014653 operations.

Comparing to Trusty, we see a wide performance improvement all around.

The next question is, is this benchmark an appropriate demonstration of
performance? Since the SRU is about changing the sys_membarrier syscall
command options, we should really be profiling based on the performance
of the syscall, as this indicates actual performance in real workloads,
since we block on sys_membarrier in the unpatched version, we would
expect the syscall to be invoked less.

Perf Performance Analysis on "sys_enter_membarrier" Tracepoint
==============================================================

No changes to source code
=========================

# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10
nr_reads   5641721906 nr_writes          932 nr_ops   5641722838
607      syscalls:sys_enter_membarrier
nr_reads   6168632959 nr_writes          248 nr_ops   6168633207
595      syscalls:sys_enter_membarrier
nr_reads   6481069225 nr_writes          185 nr_ops   6481069410
567      syscalls:sys_enter_membarrier
      
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10
nr_reads    644124499 nr_writes          501 nr_ops    644125000
1      syscalls:sys_enter_membarrier
nr_reads    646275413 nr_writes         2287 nr_ops    646277700
1      syscalls:sys_enter_membarrier
nr_reads    644021303 nr_writes          494 nr_ops    644021797
1      syscalls:sys_exit_membarrier
      
Commits c0bb9f and 374530 patched in
====================================

# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10
nr_reads   4322995476 nr_writes         3320 nr_ops   4322998796
835874      syscalls:sys_enter_membarrier
nr_reads   4210380395 nr_writes         2206 nr_ops   4210382601
883042      syscalls:sys_enter_membarrier
nr_reads   4233636203 nr_writes         3280 nr_ops   4233639483
867184      syscalls:sys_enter_membarrier
      
      
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10
nr_reads   6539807379 nr_writes         5289 nr_ops   6539812668
10578      syscalls:sys_enter_membarrier
nr_reads   6500401303 nr_writes        13287 nr_ops   6500414590
26574      syscalls:sys_enter_membarrier
nr_reads   6518640060 nr_writes         8780 nr_ops   6518648840
17560      syscalls:sys_enter_membarrier

Analysis
========

Now, this is some interesting data. Initially, with unchanged Bionic
source code, we see 607 sys_membarrier syscalls in 10 seconds for
test_urcu, and 1 sys_membarrier syscall for test_urcu_bp. In reality,
this is actually 0 syscalls, not 1, due to commit [1]:
64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6 which removes the use of
sys_membarrier for urcu-bp due to major performance problems blocking
syscalls have in ltt-ng.

[1] 
https://github.com/urcu/userspace-rcu/commit/64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6
(note this was backported to 0.10.1 stable release, and is in Bionic)

Looking at the patched versions, we see test_urcu syscall count to
sys_membarrier skyrockets to 835874, a whopping 1377x increase. We went
from 60 syscalls/sec to 83587 syscalls/sec, which more or less
demonstrates that the patched liburcu spent less time in kernel space,
as syscalls did not block, and exited quickly.

The patches re-enable the use of sys_membarrier in the urcu-bp variant,
and we see the number of times the syscall was called was in the order
of magnitude of 10,000 - 20,000 times over 10 seconds. This is behind
the massive 10x performance increase in the number of operations the
test did, as it went from using userspace level memory barriers to
kernel space membarrier syscalls, which are much faster.

Conclusion
==========

This SRU changes liburcu to use the MEMBARRIER_CMD_PRIVATE_EXPEDITED
command of the sys_membarrier syscall, over the previous
MEMBARRIER_CMD_SHARED command.

MEMBARRIER_CMD_SHARED blocks as it must wait for all threads in the
system to agree on the view of memory, while with
MEMBARRIER_CMD_PRIVATE_EXPEDITED, only the threads in the local process
need to agree, and MEMBARRIER_CMD_PRIVATE_EXPEDITED is guaranteed to
never block.

With the non-blocking behaviour, we see sys_membarrier operate much more
quickly, and it can complete many more times per second than the
previous implementation which blocks.

For most workloads, not getting stuck on a blocking call to
sys_membarrier should improve application performance, and while the
benchmark programs do indicate a 1/3 drop in operations undertaken, in
the normal urcu variant, the performance is in line with what you would
expect from the current state of the art, in Focal.

I believe this SRU is a net benefit to the performance to liburcu.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to liburcu in Ubuntu.
https://bugs.launchpad.net/bugs/1876230

Title:
  liburcu: Enable MEMBARRIER_CMD_PRIVATE_EXPEDITED to address
  performance problems with MEMBARRIER_CMD_SHARED

Status in liburcu package in Ubuntu:
  Fix Released
Status in liburcu source package in Bionic:
  In Progress

Bug description:
  [Impact]

  In Linux 4.3, a new syscall was defined, called "membarrier". This
  systemcall was defined specifically for use in userspace-rcu (liburcu)
  to speed up the fast path / reader side of the library. The original
  implementation in Linux 4.3 only supported the MEMBARRIER_CMD_SHARED
  subcommand of the membarrier syscall.

  MEMBARRIER_CMD_SHARED executes a memory barrier on all threads from
  all processes running on the system. When it exits, the userspace
  thread which called it is guaranteed that all running threads share
  the same world view in regards to userspace addresses which are
  consumed by readers and writers.

  The problem with MEMBARRIER_CMD_SHARED is system calls made in this
  fashion can block, since it deploys a barrier across all threads in a
  system, and some other threads can be waiting on blocking operations,
  and take time to reach the barrier.

  In Linux 4.14, this was addressed by adding the
  MEMBARRIER_CMD_PRIVATE_EXPEDITED command to the membarrier syscall. It
  only targets threads which share the same mm as the thread calling the
  membarrier syscall, aka, threads in the current process, and not all
  threads / processes in the system.

  Calls to membarrier with the MEMBARRIER_CMD_PRIVATE_EXPEDITED command
  are guaranteed non-blocking, due to using inter-processor interrupts
  to implement memory barriers.

  Because of this, membarrier calls that use
  MEMBARRIER_CMD_PRIVATE_EXPEDITED are much faster than those that use
  MEMBARRIER_CMD_SHARED.

  Since Bionic uses a 4.15 kernel, all kernel requirements are met, and
  this SRU is to enable support for MEMBARRIER_CMD_PRIVATE_EXPEDITED in
  the liburcu package.

  This brings the performance of the liburcu library back in line to
  where it was in Trusty, as this particular user has performance
  problems upon upgrading from Trusty to Bionic.

  [Test]

  Testing performance is heavily dependant on the application which
  links against liburcu, and the workload which it executes.

  A test package is available in the following ppa:
  https://launchpad.net/~mruffell/+archive/ubuntu/sf276198-test

  For the sake of testing, we can use the benchmarks provided in the
  liburcu source code. Download a copy of the source code for liburcu
  either from the repos or from github:

  $ pull-lp-source liburcu bionic
  # OR
  $ git clone https://github.com/urcu/userspace-rcu.git
  $ git checkout v0.10.1 # version in bionic

  Build the code:

  $ ./bootstrap
  $ ./configure
  $ make

  Go into the tests/benchmark directory

  $ cd tests/benchmark

  From there, you can run benchmarks for the four main usages of
  liburcu: urcu, urcu-bp, urcu-signal and urcu-mb.

  On a 8 core machine, 6 threads for readers and 2 threads for writers,
  with a 10 second runtime, execute:

  $ ./test_urcu 6 2 10
  $ ./test_urcu_bp 6 2 10
  $ ./test_urcu_signal 6 2 10
  $ ./test_urcu_mb 6 2 10

  Results:

  ./test_urcu 6 2 10
  0.10.1-1: 17612527667 reads, 268 writes, 17612527935 ops
  0.10.1-1ubuntu1: 14988437247 reads, 810069 writes, 14989247316 ops

  $ ./test_urcu_bp 6 2 10
  0.10.1-1: 1177891079 reads, 1699523 writes, 1179590602 ops
  0.10.1-1ubuntu1: 13230354737 reads, 575314 writes, 13230930051 ops

  $ ./test_urcu_signal 6 2 10
  0.10.1-1: 20128392417 reads, 6859 writes, 20128399276 ops
  0.10.1-1ubuntu1: 20501430707 reads, 6890 writes, 20501437597 ops

  $ ./test_urcu_mb 6 2 10
  0.10.1-1: 627996563 reads, 5409563 writes, 633406126 ops
  0.10.1-1ubuntu1: 653194752 reads, 4590020 writes, 657784772 ops

  The SRU only changes behaviour for urcu and urcu-bp, since they are
  the only "flavours" of liburcu which the patches change. From a pure
  ops standpoint:

  $ ./test_urcu 6 2 10
  17612527935 ops
  14989247316 ops

  $ ./test_urcu_bp 6 2 10
  1179590602 ops
  13230930051 ops

  We see that this particular benchmark workload, test_urcu sees extra
  performance overhead with MEMBARRIER_CMD_PRIVATE_EXPEDITED, which is
  explained by the extra impact that it has on the slowpath, and the
  extra amount of writes it did during my benchmark.

  The real winner in this benchmark workload is test_urcu_bp, which sees
  a 10x performance increase with MEMBARRIER_CMD_PRIVATE_EXPEDITED. Some
  of this may be down to the 3x less writes it did during my benchmark.

  Again, these benchmarks are indicative only are very "random".
  Performance is really dependant on the application which links against
  liburcu and its workload.

  [Regression Potential]

  This SRU changes the behaviour of the following libraries which
  applications link against: -lurcu and -lurcu-bp. Behaviour is not
  changed in the rest: -lurcu-qsbr, -lucru-signal and -lucru-mb.

  On Bionic, liburcu will call the membarrier syscall in urcu and urcu-
  bp. This does not change. What is changing is the semantics of that
  syscall, from MEMBARRIER_CMD_SHARED to
  MEMBARRIER_CMD_PRIVATE_EXPEDITED. The changed code is all run in
  kernel space and resides in the kernel. These commits simply change
  the parameters which are supplied to the membarrier syscall from
  liburcu.

  I have run the testsuite that comes with the Bionic source code, and
  "make regtest", "make short_bench" and "make long_bench" pass. You
  want to run these on a cloud instance somewhere since they take
  multiple hours.

  If a regression were to occur, applications linked against -lurcu and
  -lurcu-bp would be affected. The homepage: https://liburcu.org/ offers
  a list of the major applications that use liburcu: Knot DNS, Netsniff-
  ng, Sheepdog, GlusterFS, gdnsd and LTTng.

  [Scope]

  The two commits which are being SRU'd are:

  commit c0bb9f693f926595a7cb8b4ce712cef08d9f5d49
  Author: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
  Date: Thu Dec 21 13:42:23 2017 -0500
  Subject: liburcu: Use membarrier private expedited when available
  Link: 
https://github.com/urcu/userspace-rcu/commit/c0bb9f693f926595a7cb8b4ce712cef08d9f5d49

  commit 3745305bf09e7825e75ee5b5490347ee67c6efdd
  Author: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
  Date: Fri Dec 22 10:57:59 2017 -0500
  Subject: liburcu-bp: Use membarrier private expedited when available
  Link: 
https://github.com/urcu/userspace-rcu/commit/3745305bf09e7825e75ee5b5490347ee67c6efdd

  Both cherry pick directly onto 0.10.1 in Bionic, and are originally
  from 0.11.0, meaning that Eoan, Focal and Groovy already have the
  patch.

  [Other]

  If you are interested in how the membarrier syscall works, you can
  read their commits in the Linux kernel:

  commit 5b25b13ab08f616efd566347d809b4ece54570d1
  Author: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
  Date:   Fri Sep 11 13:07:39 2015 -0700
  Subject: sys_membarrier(): system-wide memory barrier (generic, x86)
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b25b13ab08f616efd566347d809b4ece54570d1

  commit 22e4ebb975822833b083533035233d128b30e98f
  Author: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
  Date:   Fri Jul 28 16:40:40 2017 -0400
  Subject: membarrier: Provide expedited private command
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22e4ebb975822833b083533035233d128b30e98f

  Additionally, blog posts from LTTng:
  
https://lttng.org/blog/2018/01/15/membarrier-system-call-performance-and-userspace-rcu/

  And Phoronix:
  
https://www.phoronix.com/scan.php?page=news_item&px=URCU-Membarrier-Performance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/liburcu/+bug/1876230/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to