Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> writes: <snip> > > > Subject: [PATCH v8 1/3] doc: add optimizations using C11 atomic built-ins > > > > Add information about possible optimizations using C11 atomic built-ins. > > > > Signed-off-by: Phil Yang <phil.y...@arm.com> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > > Thanks for the changes, they look good now. > > David wanted to change 'built-ins' to 'builtins', otherwise > Reviewed-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Will do in the next version. Thanks. > > > --- > > doc/guides/prog_guide/writing_efficient_code.rst | 59 > > +++++++++++++++++++++++- > > 1 file changed, 58 insertions(+), 1 deletion(-) > > > > diff --git a/doc/guides/prog_guide/writing_efficient_code.rst > > b/doc/guides/prog_guide/writing_efficient_code.rst > > index 849f63e..53a1ca1 100644 > > --- a/doc/guides/prog_guide/writing_efficient_code.rst > > +++ b/doc/guides/prog_guide/writing_efficient_code.rst > > @@ -167,7 +167,13 @@ but with the added cost of lower throughput. > > Locks and Atomic Operations > > --------------------------- > > > > -Atomic operations imply a lock prefix before the instruction, > > +This section describes some key considerations when using locks and > > +atomic operations in the DPDK environment. > > + > > +Locks > > +~~~~~ > > + > > +On x86, atomic operations imply a lock prefix before the instruction, > > causing the processor's LOCK# signal to be asserted during execution of > the > > following instruction. > > This has a big impact on performance in a multicore environment. > > > > @@ -176,6 +182,57 @@ It can often be replaced by other solutions like per- > > lcore variables. > > Also, some locking techniques are more efficient than others. > > For instance, the Read-Copy-Update (RCU) algorithm can frequently > replace > > simple rwlocks. > > > > +Atomic Operations: Use C11 Atomic Built-ins > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +DPDK generic rte_atomic operations are implemented by __sync built-ins. > > +These __sync built-ins result in full barriers on aarch64, which are > > +unnecessary in many use cases. They can be replaced by __atomic > > +built-ins that conform to the C11 memory model and provide finer > memory > > order control. > > + > > +So replacing the rte_atomic operations with __atomic built-ins might > > +improve performance for aarch64 machines. > > + > > +Some typical optimization cases are listed below: > > + > > +Atomicity > > +^^^^^^^^^ > > + > > +Some use cases require atomicity alone, the ordering of the memory > > +operations does not matter. For example, the packet statistics counters > > +need to be incremented atomically but do not need any particular > memory > > ordering. > > +So, RELAXED memory ordering is sufficient. > > + > > +One-way Barrier > > +^^^^^^^^^^^^^^^ > > + > > +Some use cases allow for memory reordering in one way while requiring > > +memory ordering in the other direction. > > + > > +For example, the memory operations before the spinlock lock are allowed > > +to move to the critical section, but the memory operations in the > > +critical section are not allowed to move above the lock. In this case, > > +the full memory barrier in the compare-and-swap operation can be > replaced > > with ACQUIRE memory order. > > +On the other hand, the memory operations after the spinlock unlock are > > +allowed to move to the critical section, but the memory operations in > > +the critical section are not allowed to move below the unlock. So the > > +full barrier in the store operation can use RELEASE memory order. > > + > > +Reader-Writer Concurrency > > +^^^^^^^^^^^^^^^^^^^^^^^^^ > > + > > +Lock-free reader-writer concurrency is one of the common use cases in > DPDK. > > + > > +The payload or the data that the writer wants to communicate to the > > +reader, can be written with RELAXED memory order. However, the guard > > +variable should be written with RELEASE memory order. This ensures that > > +the store to guard variable is observable only after the store to payload > > is > > observable. > > + > > +Correspondingly, on the reader side, the guard variable should be read > > +with ACQUIRE memory order. The payload or the data the writer > > +communicated, can be read with RELAXED memory order. This ensures > that, > > +if the store to guard variable is observable, the store to payload is also > > observable. > > + > > Coding Considerations > > --------------------- > > > > -- > > 2.7.4