Hi Honnappa, Thanks for your comments and suggestions. I Will update it in the next version.
Hi John, I would greatly appreciate it if you kindly give me some feedback on this proposal. Thanks, Phil > Hi Phil, > Few comments. > > <snip> > > > Subject: [PATCH v7 1/3] doc: add generic atomic deprecation section > I think we should change this as follows: > "doc: add optimizations using C11 atomic built-ins" > > > > > Add deprecating the generic rte_atomic_xx APIs to C11 atomic built-ins > guide > > and examples. > I think same here, something like following would help: > "Add information about possible optimizations using C11 atomic built-ins" > > > > > Signed-off-by: Phil Yang <phil.y...@arm.com> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com> > > --- > > doc/guides/prog_guide/writing_efficient_code.rst | 64 > > +++++++++++++++++++++++- > > 1 file changed, 63 insertions(+), 1 deletion(-) > > > > diff --git a/doc/guides/prog_guide/writing_efficient_code.rst > > b/doc/guides/prog_guide/writing_efficient_code.rst > > index 849f63e..16d6188 100644 > > --- a/doc/guides/prog_guide/writing_efficient_code.rst > > +++ b/doc/guides/prog_guide/writing_efficient_code.rst > > @@ -167,7 +167,13 @@ but with the added cost of lower throughput. > > Locks and Atomic Operations > > --------------------------- > > > > -Atomic operations imply a lock prefix before the instruction, > > +This section describes some key considerations when using locks and > > +atomic operations in the DPDK environment. > > + > > +Locks > > +~~~~~ > > + > > +On x86, atomic operations imply a lock prefix before the instruction, > > causing the processor's LOCK# signal to be asserted during execution of > the > > following instruction. > > This has a big impact on performance in a multicore environment. > > > > @@ -176,6 +182,62 @@ It can often be replaced by other solutions like per- > > lcore variables. > > Also, some locking techniques are more efficient than others. > > For instance, the Read-Copy-Update (RCU) algorithm can frequently > replace > > simple rwlocks. > > > > +Atomic Operations: Use C11 Atomic Built-ins > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +DPDK generic rte_atomic operations are implemented by `__sync built-ins > > +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_. > We can skip the hyper-link. Google search on "__sync built-in functions" > leads to this page easily. > > > +These __sync built-ins result in full barriers on aarch64, which are > > +unnecessary in many use cases. They can be replaced by `__atomic > > +built-ins > > > +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ > Same here, we can skip the hyper-link. > > > +that conform to the C11 memory model and provide finer memory order > > control. > > + > > +So replacing the rte_atomic operations with __atomic built-ins might > > +improve performance for aarch64 machines. > > + > > +Some typical optimization cases are listed below: > > + > > +Atomicity > > +^^^^^^^^^ > > + > > +Some use cases require atomicity alone, the ordering of the memory > > +operations does not matter. For example the packets statistics in > > +``virtio_xmit()`` function of ``vhost`` example application. It just > > +updates the number of transmitted packets, no subsequent logic > depends > Instead of referring to the code here, it is better to keep it generic. May be > something like below? > "For example, the packet statistics counters need to be incremented > atomically but do not need any particular memory ordering. So, RELAXED > memory ordering is sufficient". > > > +on these counters. So the RELAXED memory ordering is sufficient. > > + > > +One-way Barrier > > +^^^^^^^^^^^^^^^ > > + > > +Some use cases allow for memory reordering in one way while requiring > > +memory ordering in the other direction. > > + > Spinlock is a good example, making is little more generic would be helpful. > > > +For example, the memory operations before the ``rte_spinlock_lock()`` > Instead of referring to "rte_spinlock_lock" can you just say spinlock lock > > +can move to the critical section, but the memory operations in the > ^^^ are allowed to > > +critical section cannot move above the lock. In this case, the full > ^^^^^^ are not allowed to > > +memory barrier in the compare-and-swap operation can be replaced to > > ^^ with > > +ACQUIRE. On the other hand, the memory operations after the > ^^^^^^^^ ACQUIRE memory order. > > +``rte_spinlock_unlock()`` can move to the critical section, but the > ^^^ are allowed to > > +memory operations in the critical section cannot move below the unlock. > > ^^^^^^ are not allowed to > > +So the full barrier in the STORE operation can be replaced with RELEASE. > I think this above statement can be replaced with something like below: > > "So the full barrier in the store operation can use RELEASE memory order." > > > + > > +Reader-Writer Concurrency > > +^^^^^^^^^^^^^^^^^^^^^^^^^ > > + > > +Lock-free reader-writer concurrency is one of the common use cases in > DPDK. > > + > > +The payload or the data that the writer wants to communicate to the > > +reader, can be written with RELAXED memory order. However, the guard > > +variable should be written with RELEASE memory order. This ensures that > > +the store to guard variable is observable only after the store to payload > > is > > observable. > > +Refer to ``rte_hash_cuckoo_insert_mw()`` for an example. > We can remove just this line above. > > > + > > +Correspondingly, on the reader side, the guard variable should be read > > +with ACQUIRE memory order. The payload or the data the writer > > +communicated, can be read with RELAXED memory order. This ensures > that, > > +if the store to guard variable is observable, the store to payload is also > > observable. > > +Refer to rte_hash ``search_one_bucket_lf()`` for an example. > Same here, suggest removing just the above line. > > > + > > Coding Considerations > > --------------------- > > > > -- > > 2.7.4