On Thu, 21 Nov 2024, Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote: > On 2024-10-21 19:35, Paul E. McKenney wrote: >> On Mon, Oct 21, 2024 at 03:53:04PM -0400, Olivier Dion wrote: [...] >> How much of the added "Volatile access" overhead is due to the volatile >> load and how much to the cmm_ptr_eq? Many use cases do not need to >> compare pointers, except maybe against NULL. Or against a sentinel. >> In both cases, an equality comparison means no dereferncing, so no >> problems. > > Olivier will prepare benchmarks without the cmm_ptr_eq() so we can isolate > the overhead contribution of volatile vs atomic builtins more > specifically.
Here is the micro-benchmark without pointers comparison. Tight loop of rcu_derefenrece() ran 1 000 000 000 times: Hardware: ARM Cortex-A57 Overview: | Implementation | Instructions | Cycles | Branch misses | Task clock (ms) | Insn/cycle | |----------------+----------------+----------------+---------------+-----------------+------------| | Volatile (V) | 10 006 366 281 | 6 011 214 706 | 21 168 | 3 159.60 | 1.66 | | Atomic (A) | 10 020 098 136 | 21 081 007 289 | 46 091 | 11 039.38 | 0.48 | |----------------+----------------+----------------+---------------+-----------------+------------| | Δ (A / V - 1) | 0.14 % | 250.69 % | 117.74 % | 249.39 % | -71.08 % | Volatile: 0000000000000860 <func>: 860: 90000100 adrp x0, 20000 <__libc_start_main@GLIBC_2.34> 864: 91012001 add x1, x0, #0x48 868: f9402400 ldr x0, [x0, #72] ;; rcu_dereference() 86c: f9400000 ldr x0, [x0] 870: f9000420 str x0, [x1, #8] 874: d65f03c0 ret 3,159.60 msec task-clock # 0.999 CPUs utilized 3 context-switches # 0.949 /sec 0 cpu-migrations # 0.000 /sec 42 page-faults # 13.293 /sec 6,011,214,706 cycles # 1.903 GHz 10,006,366,281 instructions # 1.66 insn per cycle <not supported> branches 21,168 branch-misses 3.161819264 seconds time elapsed 3.161902000 seconds user 0.000000000 seconds sys Atomic: 0000000000000860 <func>: 860: 90000100 adrp x0, 20000 <__libc_start_main@GLIBC_2.34> 864: 91012000 add x0, x0, #0x48 868: c8dffc01 ldar x1, [x0] ;; rcu_dereference() 86c: f9400021 ldr x1, [x1] 870: f9000401 str x1, [x0, #8] 874: d65f03c0 ret 11,039.38 msec task-clock # 1.000 CPUs utilized 20 context-switches # 1.812 /sec 0 cpu-migrations # 0.000 /sec 43 page-faults # 3.895 /sec 21,081,007,289 cycles # 1.910 GHz 10,020,098,136 instructions # 0.48 insn per cycle <not supported> branches 46,091 branch-misses 11.042103521 seconds time elapsed 11.041847000 seconds user 0.000000000 seconds sys [...] -- Olivier Dion EfficiOS Inc. https://www.efficios.com