On 08/10/2018, 08:27, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> wrote:

    -----Original Message-----
    > Date: Sun, 7 Oct 2018 21:09:25 +0000
    > From: Ola Liljedahl <ola.liljed...@arm.com>
    > To: Jerin Jacob <jerin.ja...@caviumnetworks.com>, Jan Viktorin
    >  <vikto...@rehivetech.com>, "Gavin Hu (Arm Technology China)"
    >  <gavin...@arm.com>
    > CC: "dev@dpdk.org" <dev@dpdk.org>, "tho...@monjalon.net"
    >  <tho...@monjalon.net>
    > Subject: Re: [dpdk-dev] [PATCH] eal/armv7: add support for rte pause
    > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >
    > External Email
    >
    > On 07/10/2018, 08:32, "Jerin Jacob" <jerin.ja...@caviumnetworks.com> 
wrote:
    >
    >     Add support for rte_pause() implementation for armv7.
    >
    >     Signed-off-by: Jerin Jacob <jerin.ja...@caviumnetworks.com>
    >     ---
    >
    >     The reference implementation for Linux's cpu_relax() for armv7 is at
    >     
https://elixir.bootlin.com/linux/latest/source/arch/arm/include/asm/processor.h#L100
    >
    >     ---
    >      lib/librte_eal/common/include/arch/arm/rte_pause_32.h | 4 +++-
    >      1 file changed, 3 insertions(+), 1 deletion(-)
    >
    >     diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_32.h 
b/lib/librte_eal/common/include/arch/arm/rte_pause_32.h
    >     index d4768c7a9..9b856e0cf 100644
    >     --- a/lib/librte_eal/common/include/arch/arm/rte_pause_32.h
    >     +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_32.h
    >     @@ -9,11 +9,13 @@
    >      extern "C" {
    >      #endif
    >
    >     -#include <rte_common.h>
    >     +#include <rte_atomic.h>
    >     +
    >      #include "generic/rte_pause.h"
    >
    >      static inline void rte_pause(void)
    >      {
    >     +rte_compiler_barrier();
    > The compiler barrier is not mandated by the DPDK documentation for 
rte_pause():
    > http://doc.dpdk.org/api/rte__pause_8h.html

    We can add that explicitly if required to inline with other arch. Just like
    Linux kernel's cpu_relax()
I think the documentation should specify this compiler barrier if it is needed 
for correct behaviour.


    >
    > You have to go all the way to the source and GCC documentation to 
discover that for GCC, rte_pause calls _mm_pause() which in turn is implemented 
using __builtin_ia32_pause().
    > https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html
    > void __builtin_ia32_pause (void)
    > Generates the pause machine instruction with a compiler memory barrier.

    Yes. IMO, it makes sense to have compiler memory barrier to make sure it
    waits semantically at least WRT current rte_pause() usage.
Current *non-C11* usage. But more and more code in DPDK uses the C11 memory 
model.


    >
    > If you are using C11 atomic operations e.g. for polling a location, the 
atomic operations will be able to provide the required semantics (e.g. don't 
merge atomic loads from different iterations of a loop, optionally provide 
acquire and/or release (or stronger) ordering. A compiler barrier here 
interferes with the (possibly weaker) barriers from the atomic operations. We 
could use a C11-version of rte_pause() that doesn't have the compiler barrier. 
But actually, we want support for WFE, x86 also has something similar now, 
MONITOR/MWAIT

    If it is WFE then who will wake up from the power saving state. SEV from the
    other thread?
SEV/WFE is the ARMv7 way of waiting for event but the waking up is very crude 
(SEV broadcasts an event to *all* cores). ARMv8 introduces a new way where the 
waiting thread uses SEVL/WFE/LDXR/WFE to wait for a specific location (in 
practice cache line) to be updated and whichever thread writes the location 
will automatically notify any waiters (no SEV needed). See code example in 
other email thread.


    What would be a C11 version of rte_pause()?
A function that stalls the CPU for some ten(s) of cycles. No implicit or 
explicit (compiler) barriers. E.g. ISB on ARM which - unlink NOP - actually 
stalls the pipeline for 10-20 cycles (but ISB will also have HW barrier 
semantics). But as I wrote above, using WFE would be better (at least has been 
better in the internal benchmarks I have done/seen). Much better to focus our 
efforts on how to make use of WFE for C11 code.


    >
    > -- Ola
    >
    >
    >      }
    >
    >      #ifdef __cplusplus
    >     --
    >     2.19.0
    >
    >
    >
    > IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

Reply via email to